WorldWideScience

Sample records for gpu software performance

  1. Travel Software using GPU Hardware

    Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department

    2015-01-01

    Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...

  2. The performances of R GPU implementations of the GMRES method

    Bogdan Oancea

    2018-03-01

    Full Text Available Although the performance of commodity computers has improved drastically with the introduction of multicore processors and GPU computing, the standard R distribution is still based on single-threaded model of computation, using only a small fraction of the computational power available now for most desktops and laptops. Modern statistical software packages rely on high performance implementations of the linear algebra routines there are at the core of several important leading edge statistical methods. In this paper we present a GPU implementation of the GMRES iterative method for solving linear systems. We compare the performance of this implementation with a pure single threaded version of the CPU. We also investigate the performance of our implementation using different GPU packages available now for R such as gmatrix, gputools or gpuR which are based on CUDA or OpenCL frameworks.

  3. GPU Based Software Correlators - Perspectives for VLBI2010

    Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun

    2010-01-01

    Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.

  4. Software Graphics Processing Unit (sGPU) for Deep Space Applications

    McCabe, Mary; Salazar, George; Steele, Glen

    2015-01-01

    A graphics processing capability will be required for deep space missions and must include a range of applications, from safety-critical vehicle health status to telemedicine for crew health. However, preliminary radiation testing of commercial graphics processing cards suggest they cannot operate in the deep space radiation environment. Investigation into an Software Graphics Processing Unit (sGPU)comprised of commercial-equivalent radiation hardened/tolerant single board computers, field programmable gate arrays, and safety-critical display software shows promising results. Preliminary performance of approximately 30 frames per second (FPS) has been achieved. Use of multi-core processors may provide a significant increase in performance.

  5. GPU-based high-performance computing for radiation therapy

    Jia, Xun; Jiang, Steve B; Ziegenhein, Peter

    2014-01-01

    Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. (topical review)

  6. A multi-GPU real-time dose simulation software framework for lung radiotherapy.

    Santhanam, A P; Min, Y; Neelakkantan, H; Papp, N; Meeks, S L; Kupelian, P A

    2012-09-01

    Medical simulation frameworks facilitate both the preoperative and postoperative analysis of the patient's pathophysical condition. Of particular importance is the simulation of radiation dose delivery for real-time radiotherapy monitoring and retrospective analyses of the patient's treatment. In this paper, a software framework tailored for the development of simulation-based real-time radiation dose monitoring medical applications is discussed. A multi-GPU-based computational framework coupled with inter-process communication methods is introduced for simulating the radiation dose delivery on a deformable 3D volumetric lung model and its real-time visualization. The model deformation and the corresponding dose calculation are allocated among the GPUs in a task-specific manner and is performed in a pipelined manner. Radiation dose calculations are computed on two different GPU hardware architectures. The integration of this computational framework with a front-end software layer and back-end patient database repository is also discussed. Real-time simulation of the dose delivered is achieved at once every 120 ms using the proposed framework. With a linear increase in the number of GPU cores, the computational time of the simulation was linearly decreased. The inter-process communication time also improved with an increase in the hardware memory. Variations in the delivered dose and computational speedup for variations in the data dimensions are investigated using D70 and D90 as well as gEUD as metrics for a set of 14 patients. Computational speed-up increased with an increase in the beam dimensions when compared with a CPU-based commercial software while the error in the dose calculation was lung model-based radiotherapy is an effective tool for performing both real-time and retrospective analyses.

  7. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.

    Ahmed Shamsul Arefin

    Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.

  8. High-Performance Matrix-Vector Multiplication on the GPU

    Sørensen, Hans Henrik Brandenborg

    2012-01-01

    In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...

  9. High performance MRI simulations of motion on multi-GPU systems.

    Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H

    2014-07-04

    MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation

  10. GPU accelerated CT reconstruction for clinical use: quality driven performance

    Vaz, Michael S.; Sneyders, Yuri; McLin, Matthew; Ricker, Alan; Kimpe, Tom

    2007-03-01

    We present performance and quality analysis of GPU accelerated FDK filtered backprojection for cone beam computed tomography (CBCT) reconstruction. Our implementation of the FDK CT reconstruction algorithm does not compromise fidelity at any stage and yields a result that is within 1 HU of a reference C++ implementation. Our streaming implementation is able to perform reconstruction as the images are acquired; it addresses low latency as well as fast throughput, which are key considerations for a "real-time" design. Further, it is scaleable to multiple GPUs for increased performance. The implementation does not place any constraints on image acquisition; it works effectively for arbitrary angular coverage with arbitrary angular spacing. As such, this GPU accelerated CT reconstruction solution may easily be used with scanners that are already deployed. We are able to reconstruct a 512 x 512 x 340 volume from 625 projections, each sized 1024 x 768, in less than 50 seconds. The quoted 50 second timing encompasses the entire reconstruction using bilinear interpolation and includes filtering on the CPU, uploading the filtered projections to the GPU, and also downloading the reconstructed volume from GPU memory to system RAM.

  11. Performance of Сellular Automata-based Stream Ciphers in GPU Implementation

    P. G. Klyucharev

    2016-01-01

    Full Text Available Earlier the author had developed methods to build high-performance generalized cellular automata-based symmetric ciphers, which allow obtaining the encryption algorithms that show extremely high performance in hardware implementation. However, their implementation based on the conventional microprocessors lacks high performance. The mere fact is quite common - it shows a scope of applications for these ciphers. Nevertheless, the use of graphic processors enables achieving an appropriate performance for a software implementation.The article is extension of a series of the articles, which study various aspects to construct and implement cryptographic algorithms based on the generalized cellular automata. The article is aimed at studying the capabilities to implement the GPU-based cryptographic algorithms under consideration.Representing a key generator, the implemented encryption algorithm comprises 2k generalized cellular automata. The cellular automata graphs are Ramanujan’s ones. The cells of produced k gamma streams alternate, thereby allowing the GPU capabilities to be better used.To implement was used OpenCL, as the most universal and platform-independent API. The software written in C ++ was designed so that the user could set various parameters, including the encryption key, the graph structure, the local communication function, various constants, etc. To test were used a variety of graphics processors (NVIDIA GTX 650; NVIDIA GTX 770; AMD R9 280X.Depending on operating conditions, and GPU used, a performance range is from 0.47 to 6.61 Gb / s, which is comparable to the performance of the countertypes.Thus, the article has demonstrated that using the GPU makes it is possible to provide efficient software implementation of stream ciphers based on the generalized cellular automata.This work was supported by the RFBR, the project №16-07-00542.

  12. Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture

    Nicolin Govender

    2016-01-01

    Full Text Available Blaze-DEMGPU is a modular GPU based discrete element method (DEM framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.

  13. High performance GPU processing for inversion using uniform grid searches

    Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios

    2017-04-01

    Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on

  14. Highly-optimized TWSM software package for seismic diffraction modeling adapted for GPU-cluster

    Zyatkov, Nikolay; Ayzenberg, Alena; Aizenberg, Arkady

    2015-04-01

    Oil producing companies concern to increase resolution capability of seismic data for complex oil-and-gas bearing deposits connected with salt domes, basalt traps, reefs, lenses, etc. Known methods of seismic wave theory define shape of hydrocarbon accumulation with nonsufficient resolution, since they do not account for multiple diffractions explicitly. We elaborate alternative seismic wave theory in terms of operators of propagation in layers and reflection-transmission at curved interfaces. Approximation of this theory is realized in the seismic frequency range as the Tip-Wave Superposition Method (TWSM). TWSM based on the operator theory allows to evaluate of wavefield in bounded domains/layers with geometrical shadow zones (in nature it can be: salt domes, basalt traps, reefs, lenses, etc.) accounting for so-called cascade diffraction. Cascade diffraction includes edge waves from sharp edges, creeping waves near concave parts of interfaces, waves of the whispering galleries near convex parts of interfaces, etc. The basic algorithm of TWSM package is based on multiplication of large-size matrices (make hundreds of terabytes in size). We use advanced information technologies for effective realization of numerical procedures of the TWSM. In particular, we actively use NVIDIA CUDA technology and GPU accelerators allowing to significantly improve the performance of the TWSM software package, that is important in using it for direct and inverse problems. The accuracy, stability and efficiency of the algorithm are justified by numerical examples with curved interfaces. TWSM package and its separate components can be used in different modeling tasks such as planning of acquisition systems, physical interpretation of laboratory modeling, modeling of individual waves of different types and in some inverse tasks such as imaging in case of laterally inhomogeneous overburden, AVO inversion.

  15. GPU-based high performance Monte Carlo simulation in neutron transport

    Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br

    2009-07-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)

  16. GPU-based high performance Monte Carlo simulation in neutron transport

    Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

    2009-01-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)

  17. GPU-accelerated back-projection revisited. Squeezing performance by careful tuning

    Papenhausen, Eric; Zheng, Ziyi; Mueller, Klaus [Stony Brook Univ., NY (United States). Computer Science Dept.

    2011-07-01

    In recent years, GPUs have become an increasingly popular tool in computed tomography (CT) reconstruction. In this paper, we discuss performance optimization techniques for a GPU-based filtered-backprojection reconstruction implementation. We explore the different optimization techniques we used and explain how those techniques affected performance. Our results show a nearly 50% increase in performance when compared to the current top ranked GPU implementation. (orig.)

  18. CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing

    YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo

    2009-01-01

    [ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...

  19. Celeris: A GPU-accelerated open source software with a Boussinesq-type wave solver for real-time interactive simulation and visualization

    Tavakkol, Sasan; Lynett, Patrick

    2017-08-01

    In this paper, we introduce an interactive coastal wave simulation and visualization software, called Celeris. Celeris is an open source software which needs minimum preparation to run on a Windows machine. The software solves the extended Boussinesq equations using a hybrid finite volume-finite difference method and supports moving shoreline boundaries. The simulation and visualization are performed on the GPU using Direct3D libraries, which enables the software to run faster than real-time. Celeris provides a first-of-its-kind interactive modeling platform for coastal wave applications and it supports simultaneous visualization with both photorealistic and colormapped rendering capabilities. We validate our software through comparison with three standard benchmarks for non-breaking and breaking waves.

  20. An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU

    Yoon, Jong Seon; Choi, Hyoung Gwon; Jeon, Byoung Jin

    2017-01-01

    The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.

  1. An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU

    Yoon, Jong Seon; Choi, Hyoung Gwon [Seoul Nat’l Univ. of Science and Technology, Seoul (Korea, Republic of); Jeon, Byoung Jin [Yonsei Univ., Seoul (Korea, Republic of)

    2017-02-15

    The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.

  2. A Generic High-performance GPU-based Library for PDE solvers

    Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter

    , the privilege of high-performance parallel computing is now in principle accessible for many scientific users, no matter their economic resources. Though being highly effective units, GPUs and parallel architectures in general, pose challenges for software developers to utilize their efficiency. Sequential...... legacy codes are not always easily parallelized and the time spent on conversion might not pay o in the end. We present a highly generic C++ library for fast assembling of partial differential equation (PDE) solvers, aiming at utilizing the computational resources of GPUs. The library requires a minimum...... of GPU computing knowledge, while still oering the possibility to customize user-specic solvers at kernel level if desired. Spatial dierential operators are based on matrix free exible order nite dierence approximations. These matrix free operators minimize both memory consumption and main memory access...

  3. Modular Software Performance Monitoring

    Kruse, D F

    2011-01-01

    CPU clock frequency is not likely to be increased significantly in the coming years, and data analysis speed can be improved by using more processors or buying new machines, only if one is willing to change the paradigm to a parallel one. Therefore, performance monitoring procedures and tools are needed to help programmers to optimize existing software running on current and future hardware. Low level information from hardware performance counters is vital to spot specific performance problems slowing program execution. HEP software is often huge and complex, and existing tools are unable to give results with the required granularity. We will report on the approach we have chose to solve this problem that involves decomposing the application into parts and monitoring each of them separately. Both counting and sampling methods are used to allow an analysis with the required custom granularity: from global level, up to the function level. A set of tools (based on perfmon2 – a software interface to hardware co...

  4. Hybrid GPU-CPU adaptive precision ray-triangle intersection tests for robust high-performance GPU dosimetry computations

    Perrotte, Lancelot; Bodin, Bruno; Chodorge, Laurent

    2011-01-01

    Before an intervention on a nuclear site, it is essential to study different scenarios to identify the less dangerous one for the operator. Therefore, it is mandatory to dispose of an efficient dosimetry simulation code with accurate results. One classical method in radiation protection is the straight-line attenuation method with build-up factors. In the case of 3D industrial scenes composed of meshes, the computation cost resides in the fast computation of all of the intersections between the rays and the triangles of the scene. Efficient GPU algorithms have already been proposed, that enable dosimetry calculation for a huge scene (800000 rays, 800000 triangles) in a fraction of second. But these algorithms are not robust: because of the rounding caused by floating-point arithmetic, the numerical results of the ray-triangle intersection tests can differ from the expected mathematical results. In worst case scenario, this can lead to a computed dose rate dramatically inferior to the real dose rate to which the operator is exposed. In this paper, we present a hybrid GPU-CPU algorithm to manage adaptive precision floating-point arithmetic. This algorithm allows robust ray-triangle intersection tests, with very small loss of performance (less than 5 % overhead), and without any need for scene-dependent tuning. (author)

  5. ARCHERRT - a GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: software development and application to helical tomotherapy.

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X George

    2014-07-01

    Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified Woodcock tracking algorithm

  6. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    Su, Lin; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHERRT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head & neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHERRT. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHERRT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHERRT agree well with DOSXYZnrc. For clinical cases, results from ARCHERRT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head & neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU, modified

  7. High performance technique for database applicationsusing a hybrid GPU/CPU platform

    Zidan, Mohammed A.

    2012-07-28

    Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applica- tions by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations. Copyright © 2011 by ASME.

  8. Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software

    Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang; Ho, Kai-Ming; Travesset, Alex

    2018-04-01

    We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu64.5Zr35.5, and pair correlation function g (r) of liquid Ni3Al. Our code scales well with the size of the simulating system on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. The source code can be accessed through the HOOMD-blue web page for free by any interested user.

  9. High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU

    STOJANOVIC, N.

    2014-11-01

    Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.

  10. GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores

    Wang Kai

    2011-05-01

    Full Text Available Abstract Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs have multiple cores, whereas Graphics Processing Units (GPUs also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1 the interaction of SNPs within it in parallel, and 2 the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.

  11. Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

    Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

    2009-08-31

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.

  12. Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

    Allada, Veerendra; Benjegerdes, Troy; Bode, Brett

    2009-01-01

    Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.

  13. A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors

    Seo, Jiwon; Chen, Yu-Hsuan; De Lorenzo, David S.; Lo, Sherman; Enge, Per; Akos, Dennis; Lee, Jiyun

    2011-01-01

    Due to their weak received signal power, Global Positioning System (GPS) signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs). However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR) with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU) coupled with a new generation Graphics Processing Unit (GPU) having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities. PMID:22164116

  14. A Real-Time Capable Software-Defined Receiver Using GPU for Adaptive Anti-Jam GPS Sensors

    Dennis Akos

    2011-09-01

    Full Text Available Due to their weak received signal power, Global Positioning System (GPS signals are vulnerable to radio frequency interference. Adaptive beam and null steering of the gain pattern of a GPS antenna array can significantly increase the resistance of GPS sensors to signal interference and jamming. Since adaptive array processing requires intensive computational power, beamsteering GPS receivers were usually implemented using hardware such as field-programmable gate arrays (FPGAs. However, a software implementation using general-purpose processors is much more desirable because of its flexibility and cost effectiveness. This paper presents a GPS software-defined radio (SDR with adaptive beamsteering capability for anti-jam applications. The GPS SDR design is based on an optimized desktop parallel processing architecture using a quad-core Central Processing Unit (CPU coupled with a new generation Graphics Processing Unit (GPU having massively parallel processors. This GPS SDR demonstrates sufficient computational capability to support a four-element antenna array and future GPS L5 signal processing in real time. After providing the details of our design and optimization schemes for future GPU-based GPS SDR developments, the jamming resistance of our GPS SDR under synthetic wideband jamming is presented. Since the GPS SDR uses commercial-off-the-shelf hardware and processors, it can be easily adopted in civil GPS applications requiring anti-jam capabilities.

  15. ARCHERRT – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    Su, Lin; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George; Yang, Youming; Bednarz, Bryan; Sterpin, Edmond

    2014-01-01

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER RT is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER RT . Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER RT and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER RT agree well with DOSXYZnrc. For clinical cases, results from ARCHER RT are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to specific architecture of GPU

  16. The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing

    2017-08-01

    used for its GPU computing capability during the experiment. It has Nvidia Tesla K40 GPU accelerators containing 32 GPU nodes consisting of 1024...cores. CUDA is a parallel computing platform and application programming interface (API) model that was created and designed by Nvidia to give direct...Agricultural and Forest Meteorology. 1995:76:277–291, ISSN 0168-1923. 3. GPU vs. CPU? What is GPU computing? Santa Clara (CA): Nvidia Corporation; 2017

  17. GPU Performance and Power Consumption Analysis: A DCT based denoising application

    Pi Puig, Martín; De Giusti, Laura Cristina; Naiouf, Marcelo; De Giusti, Armando Eduardo

    2017-01-01

    It is known that energy and power consumption are becoming serious metrics in the design of high performance workstations because of heat dissipation problems. In the last years, GPU accelerators have been integrating many of these expensive systems despite they are embedding more and more transistors on their chips producing a quick increase of power consumption requirements. This paper analyzes an image processing application, in particular a Discrete Cosine Transform denoising algorithm, i...

  18. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Ammendola A, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

    2014-01-01

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth

  19. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

    2014-06-01

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

  20. Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

    Ammendola A, R [INFN Roma II, Via della Ricerca Scientifica 1 – 00133 Roma (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma I, P.le Aldo Moro 2 – 00185 Roma (Italy)

    2014-06-06

    APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

  1. A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

    Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.

    2014-01-01

    Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055

  2. A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

    Ginés D. Guerrero

    2014-01-01

    Full Text Available Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO. This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.

  3. GPU: the biggest key processor for AI and parallel processing

    Baji, Toru

    2017-07-01

    Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.

  4. GPU in Physics Computation: Case Geant4 Navigation

    Seiskari, Otto; Niemi, Tapio

    2012-01-01

    General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.

  5. High performance in software development

    CERN. Geneva; Haapio, Petri; Liukkonen, Juha-Matti

    2015-01-01

    What are the ingredients of high-performing software? Software development, especially for large high-performance systems, is one the most complex tasks mankind has ever tried. Technological change leads to huge opportunities but challenges our old ways of working. Processing large data sets, possibly in real time or with other tight computational constraints, requires an efficient solution architecture. Efficiency requirements span from the distributed storage and large-scale organization of computation and data onto the lowest level of processor and data bus behavior. Integrating performance behavior over these levels is especially important when the computation is resource-bounded, as it is in numerics: physical simulation, machine learning, estimation of statistical models, etc. For example, memory locality and utilization of vector processing are essential for harnessing the computing power of modern processor architectures due to the deep memory hierarchies of modern general-purpose computers. As a r...

  6. High performance cellular level agent-based simulation with FLAME for the GPU.

    Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela

    2010-05-01

    Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.

  7. Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

    Huang Bormin; Mielikainen, Jarno; Oh, Hyunjong; Allen Huang, Hung-Lung

    2011-01-01

    Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware's efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs. In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364x

  8. Advanced Modular Software Performance Monitoring

    CERN. Geneva

    2012-01-01

    The LHCb software is based on the Gaudi framework, on top of which are built several large and complex software applications. The LHCb experiment is now in the active phase of collecting and analyzing data and significant performance problems arise in the Gaudi based software beginning from High Level Trigger (HLT) programs and ending with data analysis frameworks (DaVinci). It’s not easy to find hot spots in the code - only special tools can help to understand where CPU or memory usage is not reasonable. There exist many performance analyzing tools, but the main problem is that they show reports in terms of class and function names and such information usually is not very useful - the majority of algorithm developers use the Gaudi framework abstractions and usually do not know about functions which lie at the lower level. We will show a new approach which adds to performance reports a higher abstraction level based on knowledge of framework architecture and run-time object properties. A set of profiling to...

  9. Advanced modular software performance monitoring

    Mazurov, A

    2012-01-01

    The LHCb software is based on the Gaudi framework, on top of which are built several large and complex software applications. As the LHCb experiment is now in the active phase of collecting and analyzing data, performance problems arise in various parts of the software, from the High Level Trigger (HLT) programs to data analysis frameworks. It is not easy to find hotspots in the code - only specialized tools can help to understand where CPU or memory usage are not reasonable. There exist many performance analyzing tools, but the main problem is that they show reports in terms of class and function names and such information usually is not very useful - the majority of algorithm developers use the Gaudi framework abstractions and usually do not know about functions which lie at the lower level. We will show a new approach which adds to performance reports a higher abstraction level based on knowledge of framework architecture and run-time object properties. A set of profiling tools (based on Intel VTune Amplif...

  10. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

    2018-03-01

    With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  11. MILC Code Performance on High End CPU and GPU Supercomputer Clusters

    DeTar Carleton

    2018-01-01

    Full Text Available With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

  12. TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

    Liu, T; Lin, H; Xu, X; Su, L; Shi, C; Tang, X; Bednarz, B

    2016-01-01

    Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analytically derived them from the raw patient-independent PS data in IAEA’s database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.

  13. TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

    Liu, T; Lin, H; Xu, X [Rensselaer Polytechnic Institute, Troy, NY (United States); Su, L [John Hopkins University, Baltimore, MD (United States); Shi, C [Saint Vincent Medical Center, Bridgeport, CT (United States); Tang, X [Memorial Sloan Kettering Cancer Center, West Harrison, NY (United States); Bednarz, B [University of Wisconsin, Madison, WI (United States)

    2016-06-15

    Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analytically derived them from the raw patient-independent PS data in IAEA’s database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.

  14. High performance technique for database applicationsusing a hybrid GPU/CPU platform

    Zidan, Mohammed A.; Bonny, Talal; Salama, Khaled N.

    2012-01-01

    Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm

  15. GPU-accelerated micromagnetic simulations using cloud computing

    Jermain, C.L.; Rowlands, G.E.; Buhrman, R.A.; Ralph, D.C.

    2016-01-01

    Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.

  16. GPU-accelerated micromagnetic simulations using cloud computing

    Jermain, C.L., E-mail: clj72@cornell.edu [Cornell University, Ithaca, NY 14853 (United States); Rowlands, G.E.; Buhrman, R.A. [Cornell University, Ithaca, NY 14853 (United States); Ralph, D.C. [Cornell University, Ithaca, NY 14853 (United States); Kavli Institute at Cornell, Ithaca, NY 14853 (United States)

    2016-03-01

    Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.

  17. Software Performs Complex Design Analysis

    2008-01-01

    Designers use computational fluid dynamics (CFD) to gain greater understanding of the fluid flow phenomena involved in components being designed. They also use finite element analysis (FEA) as a tool to help gain greater understanding of the structural response of components to loads, stresses and strains, and the prediction of failure modes. Automated CFD and FEA engineering design has centered on shape optimization, which has been hindered by two major problems: 1) inadequate shape parameterization algorithms, and 2) inadequate algorithms for CFD and FEA grid modification. Working with software engineers at Stennis Space Center, a NASA commercial partner, Optimal Solutions Software LLC, was able to utilize its revolutionary, one-of-a-kind arbitrary shape deformation (ASD) capability-a major advancement in solving these two aforementioned problems-to optimize the shapes of complex pipe components that transport highly sensitive fluids. The ASD technology solves the problem of inadequate shape parameterization algorithms by allowing the CFD designers to freely create their own shape parameters, therefore eliminating the restriction of only being able to use the computer-aided design (CAD) parameters. The problem of inadequate algorithms for CFD grid modification is solved by the fact that the new software performs a smooth volumetric deformation. This eliminates the extremely costly process of having to remesh the grid for every shape change desired. The program can perform a design change in a markedly reduced amount of time, a process that would traditionally involve the designer returning to the CAD model to reshape and then remesh the shapes, something that has been known to take hours, days-even weeks or months-depending upon the size of the model.

  18. Improving Software Performance in the Compute Unified Device Architecture

    Alexandru PIRJAN

    2010-01-01

    Full Text Available This paper analyzes several aspects regarding the improvement of software performance for applications written in the Compute Unified Device Architecture CUDA. We address an issue of great importance when programming a CUDA application: the Graphics Processing Unit’s (GPU’s memory management through ranspose ernels. We also benchmark and evaluate the performance for progressively optimizing a transposing matrix application in CUDA. One particular interest was to research how well the optimization techniques, applied to software application written in CUDA, scale to the latest generation of general-purpose graphic processors units (GPGPU, like the Fermi architecture implemented in the GTX480 and the previous architecture implemented in GTX280. Lately, there has been a lot of interest in the literature for this type of optimization analysis, but none of the works so far (to our best knowledge tried to validate if the optimizations can apply to a GPU from the latest Fermi architecture and how well does the Fermi architecture scale to these software performance improving techniques.

  19. High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

    Khaled Benkrid

    2012-01-01

    Full Text Available This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs, Graphics Processor Units (GPUs, and IBM’s Cell Broadband Engine (Cell BE, in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools, FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs.

  20. A virtualized software based on the NVIDIA cuFFT library for image denoising: performance analysis

    Galletti, Ardelio; Marcellino, Livia; Montella, Raffaele

    2017-01-01

    Abstract Generic Virtualization Service (GVirtuS) is a new solution for enabling GPGPU on Virtual Machines or low powered devices. This paper focuses on the performance analysis that can be obtained using a GPGPU virtualized software. Recently, GVirtuS has been extended in order to support CUDA...... ancillary libraries with good results. Here, our aim is to analyze the applicability of this powerful tool to a real problem, which uses the NVIDIA cuFFT library. As case study we consider a simple denoising algorithm, implementing a virtualized GPU-parallel software based on the convolution theorem...

  1. Balancing Energy and Performance in Dense Linear System Solvers for Hybrid ARM+GPU platforms

    Juan P. Silva

    2016-04-01

    Full Text Available The high performance computing community has traditionally focused uniquely on the reduction of execution time, though in the last years, the optimization of energy consumption has become a main issue. A reduction of energy usage without a degradation of performance requires the adoption of energy-efficient hardware platforms accompanied by the development of energy-aware algorithms and computational kernels. The solution of linear systems is a key operation for many scientific and engineering problems. Its relevance has motivated an important amount of work, and consequently, it is possible to find high performance solvers for a wide variety of hardware platforms. In this work, we aim to develop a high performance and energy-efficient linear system solver. In particular, we develop two solvers for a low-power CPU-GPU platform, the NVIDIA Jetson TK1. These solvers implement the Gauss-Huard algorithm yielding an efficient usage of the target hardware as well as an efficient memory access. The experimental evaluation shows that the novel proposal reports important savings in both time and energy-consumption when compared with the state-of-the-art solvers of the platform.

  2. Performance Analysis of GFDL's GCM Line-By-Line Radiative Transfer Model on GPU and MIC Architectures

    Menzel, R.; Paynter, D.; Jones, A. L.

    2017-12-01

    Due to their relatively low computational cost, radiative transfer models in global climate models (GCMs) run on traditional CPU architectures generally consist of shortwave and longwave parameterizations over a small number of wavelength bands. With the rise of newer GPU and MIC architectures, however, the performance of high resolution line-by-line radiative transfer models may soon approach those of the physical parameterizations currently employed in GCMs. Here we present an analysis of the current performance of a new line-by-line radiative transfer model currently under development at GFDL. Although originally designed to specifically exploit GPU architectures through the use of CUDA, the radiative transfer model has recently been extended to include OpenMP in an effort to also effectively target MIC architectures such as Intel's Xeon Phi. Using input data provided by the upcoming Radiative Forcing Model Intercomparison Project (RFMIP, as part of CMIP 6), we compare model results and performance data for various model configurations and spectral resolutions run on both GPU and Intel Knights Landing architectures to analogous runs of the standard Oxford Reference Forward Model on traditional CPUs.

  3. A high performance image processing platform based on CPU-GPU heterogeneous cluster with parallel image reconstroctions for micro-CT

    Ding Yu; Qi Yujin; Zhang Xuezhu; Zhao Cuilan

    2011-01-01

    In this paper, we report the development of a high-performance image processing platform, which is based on CPU-GPU heterogeneous cluster. Currently, it consists of a Dell Precision T7500 and HP XW8600 workstations with parallel programming and runtime environment, using the message-passing interface (MPI) and CUDA (Compute Unified Device Architecture). We succeeded in developing parallel image processing techniques for 3D image reconstruction of X-ray micro-CT imaging. The results show that a GPU provides a computing efficiency of about 194 times faster than a single CPU, and the CPU-GPU clusters provides a computing efficiency of about 46 times faster than the CPU clusters. These meet the requirements of rapid 3D image reconstruction and real time image display. In conclusion, the use of CPU-GPU heterogeneous cluster is an effective way to build high-performance image processing platform. (authors)

  4. GPU Computing For Particle Tracking

    Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong

    2011-01-01

    This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ (2) is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.

  5. High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

    Nieto, J.; Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M.; Vega, J.; Castro, R.

    2016-01-01

    Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.

  6. High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

    Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)

    2016-11-15

    Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.

  7. Performance analysis of the FDTD method applied to holographic volume gratings: Multi-core CPU versus GPU computing

    Francés, J.; Bleda, S.; Neipp, C.; Márquez, A.; Pascual, I.; Beléndez, A.

    2013-03-01

    The finite-difference time-domain method (FDTD) allows electromagnetic field distribution analysis as a function of time and space. The method is applied to analyze holographic volume gratings (HVGs) for the near-field distribution at optical wavelengths. Usually, this application requires the simulation of wide areas, which implies more memory and time processing. In this work, we propose a specific implementation of the FDTD method including several add-ons for a precise simulation of optical diffractive elements. Values in the near-field region are computed considering the illumination of the grating by means of a plane wave for different angles of incidence and including absorbing boundaries as well. We compare the results obtained by FDTD with those obtained using a matrix method (MM) applied to diffraction gratings. In addition, we have developed two optimized versions of the algorithm, for both CPU and GPU, in order to analyze the improvement of using the new NVIDIA Fermi GPU architecture versus highly tuned multi-core CPU as a function of the size simulation. In particular, the optimized CPU implementation takes advantage of the arithmetic and data transfer streaming SIMD (single instruction multiple data) extensions (SSE) included explicitly in the code and also of multi-threading by means of OpenMP directives. A good agreement between the results obtained using both FDTD and MM methods is obtained, thus validating our methodology. Moreover, the performance of the GPU is compared to the SSE+OpenMP CPU implementation, and it is quantitatively determined that a highly optimized CPU program can be competitive for a wider range of simulation sizes, whereas GPU computing becomes more powerful for large-scale simulations.

  8. Performance evaluation for volumetric segmentation of multiple sclerosis lesions using MATLAB and computing engine in the graphical processing unit (GPU)

    Le, Anh H.; Park, Young W.; Ma, Kevin; Jacobs, Colin; Liu, Brent J.

    2010-03-01

    Multiple Sclerosis (MS) is a progressive neurological disease affecting myelin pathways in the brain. Multiple lesions in the white matter can cause paralysis and severe motor disabilities of the affected patient. To solve the issue of inconsistency and user-dependency in manual lesion measurement of MRI, we have proposed a 3-D automated lesion quantification algorithm to enable objective and efficient lesion volume tracking. The computer-aided detection (CAD) of MS, written in MATLAB, utilizes K-Nearest Neighbors (KNN) method to compute the probability of lesions on a per-voxel basis. Despite the highly optimized algorithm of imaging processing that is used in CAD development, MS CAD integration and evaluation in clinical workflow is technically challenging due to the requirement of high computation rates and memory bandwidth in the recursive nature of the algorithm. In this paper, we present the development and evaluation of using a computing engine in the graphical processing unit (GPU) with MATLAB for segmentation of MS lesions. The paper investigates the utilization of a high-end GPU for parallel computing of KNN in the MATLAB environment to improve algorithm performance. The integration is accomplished using NVIDIA's CUDA developmental toolkit for MATLAB. The results of this study will validate the practicality and effectiveness of the prototype MS CAD in a clinical setting. The GPU method may allow MS CAD to rapidly integrate in an electronic patient record or any disease-centric health care system.

  9. ARCHER{sub RT} – A GPU-based and photon-electron coupled Monte Carlo dose computing engine for radiation therapy: Software development and application to helical tomotherapy

    Su, Lin; Du, Xining; Liu, Tianyu; Ji, Wei; Xu, X. George, E-mail: xug2@rpi.edu [Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, New York 12180 (United States); Yang, Youming; Bednarz, Bryan [Medical Physics, University of Wisconsin, Madison, Wisconsin 53706 (United States); Sterpin, Edmond [Molecular Imaging, Radiotherapy and Oncology, Université catholique de Louvain, Brussels, Belgium 1348 (Belgium)

    2014-07-15

    Purpose: Using the graphical processing units (GPU) hardware technology, an extremely fast Monte Carlo (MC) code ARCHER{sub RT} is developed for radiation dose calculations in radiation therapy. This paper describes the detailed software development and testing for three clinical TomoTherapy® cases: the prostate, lung, and head and neck. Methods: To obtain clinically relevant dose distributions, phase space files (PSFs) created from optimized radiation therapy treatment plan fluence maps were used as the input to ARCHER{sub RT}. Patient-specific phantoms were constructed from patient CT images. Batch simulations were employed to facilitate the time-consuming task of loading large PSFs, and to improve the estimation of statistical uncertainty. Furthermore, two different Woodcock tracking algorithms were implemented and their relative performance was compared. The dose curves of an Elekta accelerator PSF incident on a homogeneous water phantom were benchmarked against DOSXYZnrc. For each of the treatment cases, dose volume histograms and isodose maps were produced from ARCHER{sub RT} and the general-purpose code, GEANT4. The gamma index analysis was performed to evaluate the similarity of voxel doses obtained from these two codes. The hardware accelerators used in this study are one NVIDIA K20 GPU, one NVIDIA K40 GPU, and six NVIDIA M2090 GPUs. In addition, to make a fairer comparison of the CPU and GPU performance, a multithreaded CPU code was developed using OpenMP and tested on an Intel E5-2620 CPU. Results: For the water phantom, the depth dose curve and dose profiles from ARCHER{sub RT} agree well with DOSXYZnrc. For clinical cases, results from ARCHER{sub RT} are compared with those from GEANT4 and good agreement is observed. Gamma index test is performed for voxels whose dose is greater than 10% of maximum dose. For 2%/2mm criteria, the passing rates for the prostate, lung case, and head and neck cases are 99.7%, 98.5%, and 97.2%, respectively. Due to

  10. Evaluation of high-performance computing software

    Browne, S.; Dongarra, J. [Univ. of Tennessee, Knoxville, TN (United States); Rowan, T. [Oak Ridge National Lab., TN (United States)

    1996-12-31

    The absence of unbiased and up to date comparative evaluations of high-performance computing software complicates a user`s search for the appropriate software package. The National HPCC Software Exchange (NHSE) is attacking this problem using an approach that includes independent evaluations of software, incorporation of author and user feedback into the evaluations, and Web access to the evaluations. We are applying this approach to the Parallel Tools Library (PTLIB), a new software repository for parallel systems software and tools, and HPC-Netlib, a high performance branch of the Netlib mathematical software repository. Updating the evaluations with feed-back and making it available via the Web helps ensure accuracy and timeliness, and using independent reviewers produces unbiased comparative evaluations difficult to find elsewhere.

  11. Strategies employed for LHC software performance studies

    Nowak, A

    2010-01-01

    The objective of this work is to collect and assess the software performance related strategies employed by the major players in the LHC software arena: the four main experiments (ALICE, ATLAS, CMS and LHCb) and the two main software frameworks (Geant4 and ROOT). As the software used differs between the parties, so do the directions and methods in optimization, and their intensity. The common feeling shared by nearly all interviewed parties is that performance is not one of their top priorities and that maintaining it at a constant level is a satisfactory solution, given the resources at hand. In principle, despite some organized efforts, a less structured approach seems to be the dominant one, and opportunistic optimization prevails. Four out of six surveyed groups are investigating memory management related effects, deemed to be the primary cause of their performance issues. The most commonly used tools include Valgrind and homegrown software. All questioned groups expressed the desire for advanced tools, s...

  12. Effect of Functional diversity on Software Performance

    Viswanatha Rao, Balajee

    2011-01-01

    For the past few decades, there has been numerous literature produced on functional diversity and performance. However, the relationship between functional diversity and performance in software industry is clearly not explained and results are found to be inconsistent. The main focus of this research is to explore the effects of functional diversity on software project performance by conducting a qualitative study. Four metrics were chosen from literature namely decision making, creativity an...

  13. The evolution of CMS software performance studies

    Kortelainen, Matti J

    2010-01-01

    CMS has had an ongoing and dedicated effort to optimize software performance for several years. Initially this effort focused primarily on the cleanup of many issues coming from basic C++ errors, namely reducing dynamic memory churn, unnecessary copies/temporaries and tools to routinely monitor these things. Over the past 1.5 years, however, the transition to 64bit, newer versions of the gcc compiler, newer tools and the enabling of techniques like vectorization have made possible more sophisticated improvements to the software performance. This presentation will cover this evolution and describe the current avenues being pursued for software performance, as well as the corresponding gains.

  14. The evolution of CMS software performance studies

    Kortelainen, M. J.; Elmer, P.; Eulisse, G.; Innocente, V.; Jones, C. D.; Tuura, L.

    2011-12-01

    CMS has had an ongoing and dedicated effort to optimize software performance for several years. Initially this effort focused primarily on the cleanup of many issues coming from basic C++ errors, namely reducing dynamic memory churn, unnecessary copies/temporaries and tools to routinely monitor these things. Over the past 1.5 years, however, the transition to 64bit, newer versions of the gcc compiler, newer tools and the enabling of techniques like vectorization have made possible more sophisticated improvements to the software performance. This presentation will cover this evolution and describe the current avenues being pursued for software performance, as well as the corresponding gains.

  15. Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography

    Michał Marks

    2012-01-01

    Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.

  16. Personal Supercomputing for Monte Carlo Simulation Using a GPU

    Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)

    2008-05-15

    Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation.

  17. Personal Supercomputing for Monte Carlo Simulation Using a GPU

    Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho

    2008-01-01

    Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation

  18. Computational Approach for Securing Radiology-Diagnostic Data in Connected Health Network using High-Performance GPU-Accelerated AES.

    Adeshina, A M; Hashim, R

    2017-03-01

    Diagnostic radiology is a core and integral part of modern medicine, paving ways for the primary care physicians in the disease diagnoses, treatments and therapy managements. Obviously, all recent standard healthcare procedures have immensely benefitted from the contemporary information technology revolutions, apparently revolutionizing those approaches to acquiring, storing and sharing of diagnostic data for efficient and timely diagnosis of diseases. Connected health network was introduced as an alternative to the ageing traditional concept in healthcare system, improving hospital-physician connectivity and clinical collaborations. Undoubtedly, the modern medicinal approach has drastically improved healthcare but at the expense of high computational cost and possible breach of diagnosis privacy. Consequently, a number of cryptographical techniques are recently being applied to clinical applications, but the challenges of not being able to successfully encrypt both the image and the textual data persist. Furthermore, processing time of encryption-decryption of medical datasets, within a considerable lower computational cost without jeopardizing the required security strength of the encryption algorithm, still remains as an outstanding issue. This study proposes a secured radiology-diagnostic data framework for connected health network using high-performance GPU-accelerated Advanced Encryption Standard. The study was evaluated with radiology image datasets consisting of brain MR and CT datasets obtained from the department of Surgery, University of North Carolina, USA, and the Swedish National Infrastructure for Computing. Sample patients' notes from the University of North Carolina, School of medicine at Chapel Hill were also used to evaluate the framework for its strength in encrypting-decrypting textual data in the form of medical report. Significantly, the framework is not only able to accurately encrypt and decrypt medical image datasets, but it also

  19. Power and performance software analysis and optimization

    Kukunas, Jim

    2015-01-01

    Power and Performance: Software Analysis and Optimization is a guide to solving performance problems in modern Linux systems. Power-efficient chips are no help if the software those chips run on is inefficient. Starting with the necessary architectural background as a foundation, the book demonstrates the proper usage of performance analysis tools in order to pinpoint the cause of performance problems, and includes best practices for handling common performance issues those tools identify. Provides expert perspective from a key member of Intel's optimization team on how processors and memory

  20. Web-based, GPU-accelerated, Monte Carlo simulation and visualization of indirect radiation imaging detector performance.

    Dong, Han; Sharma, Diksha; Badano, Aldo

    2014-12-01

    Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridmantis, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webmantis and visualmantis to facilitate the setup of computational experiments via hybridmantis. The visualization tools visualmantis and webmantis enable the user to control simulation properties through a user interface. In the case of webmantis, control via a web browser allows access through mobile devices such as smartphones or tablets. webmantis acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridmantis. The users can download the output images and statistics through a zip file for future reference. In addition, webmantis provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. The visualization tools visualmantis and webmantis provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying input parameters to receiving visual

  1. Web-based, GPU-accelerated, Monte Carlo simulation and visualization of indirect radiation imaging detector performance

    Dong, Han; Sharma, Diksha; Badano, Aldo

    2014-01-01

    Purpose: Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridMANTIS, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webMANTIS and visualMANTIS to facilitate the setup of computational experiments via hybridMANTIS. Methods: The visualization tools visualMANTIS and webMANTIS enable the user to control simulation properties through a user interface. In the case of webMANTIS, control via a web browser allows access through mobile devices such as smartphones or tablets. webMANTIS acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. Results: The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridMANTIS. The users can download the output images and statistics through a zip file for future reference. In addition, webMANTIS provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. Conclusions: The visualization tools visualMANTIS and webMANTIS provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying

  2. Web-based, GPU-accelerated, Monte Carlo simulation and visualization of indirect radiation imaging detector performance

    Dong, Han; Sharma, Diksha; Badano, Aldo, E-mail: aldo.badano@fda.hhs.gov [Division of Imaging, Diagnostics, and Software Reliability, Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, Maryland 20993 (United States)

    2014-12-15

    Purpose: Monte Carlo simulations play a vital role in the understanding of the fundamental limitations, design, and optimization of existing and emerging medical imaging systems. Efforts in this area have resulted in the development of a wide variety of open-source software packages. One such package, hybridMANTIS, uses a novel hybrid concept to model indirect scintillator detectors by balancing the computational load using dual CPU and graphics processing unit (GPU) processors, obtaining computational efficiency with reasonable accuracy. In this work, the authors describe two open-source visualization interfaces, webMANTIS and visualMANTIS to facilitate the setup of computational experiments via hybridMANTIS. Methods: The visualization tools visualMANTIS and webMANTIS enable the user to control simulation properties through a user interface. In the case of webMANTIS, control via a web browser allows access through mobile devices such as smartphones or tablets. webMANTIS acts as a server back-end and communicates with an NVIDIA GPU computing cluster that can support multiuser environments where users can execute different experiments in parallel. Results: The output consists of point response and pulse-height spectrum, and optical transport statistics generated by hybridMANTIS. The users can download the output images and statistics through a zip file for future reference. In addition, webMANTIS provides a visualization window that displays a few selected optical photon path as they get transported through the detector columns and allows the user to trace the history of the optical photons. Conclusions: The visualization tools visualMANTIS and webMANTIS provide features such as on the fly generation of pulse-height spectra and response functions for microcolumnar x-ray imagers while allowing users to save simulation parameters and results from prior experiments. The graphical interfaces simplify the simulation setup and allow the user to go directly from specifying

  3. GPU Computing Gems Emerald Edition

    Hwu, Wen-mei W

    2011-01-01

    ".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of

  4. Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison for GPU and MIC Parallel Computing Devices

    Lin, Hui; Liu, Tianyu; Su, Lin; Bednarz, Bryan; Caracappa, Peter; Xu, X. George

    2017-09-01

    Monte Carlo (MC) simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi) and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.

  5. Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison for GPU and MIC Parallel Computing Devices

    Lin Hui

    2017-01-01

    Full Text Available Monte Carlo (MC simulation is well recognized as the most accurate method for radiation dose calculations. For radiotherapy applications, accurate modelling of the source term, i.e. the clinical linear accelerator is critical to the simulation. The purpose of this paper is to perform source modelling and examine the accuracy and performance of the models on Intel Many Integrated Core coprocessors (aka Xeon Phi and Nvidia GPU using ARCHER and explore the potential optimization methods. Phase Space-based source modelling for has been implemented. Good agreements were found in a tomotherapy prostate patient case and a TrueBeam breast case. From the aspect of performance, the whole simulation for prostate plan and breast plan cost about 173s and 73s with 1% statistical error.

  6. Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations.

    Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida

    2014-05-01

    Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.

  7. GPU applications for data processing

    Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)

    2015-12-31

    Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.

  8. ATLAS Offline Software Performance Monitoring and Optimization

    Chauhan, N; Kittelmann, T; Langenberg, R; Mandrysch , R; Salzburger, A; Seuster, R; Ritsch, E; Stewart, G; van Eldik, N; Vitillo, R

    2014-01-01

    In a complex multi-developer, multi-package software environment, such as the ATLAS offline Athena framework, tracking the performance of the code can be a non-trivial task in itself. In this paper we describe improvements in the instrumentation of ATLAS offline software that have given considerable insight into the performance of the code and helped to guide optimisation. Code can be instrumented firstly using the PAPI tool, which is a programing interface for accessing hardware performance counters. PAPI events can count floating point operations, cycles and instructions and cache accesses. Triggering PAPI to start/stop counting for each algorithm and processed event gives a good understanding of the whole algorithm level performance of ATLAS code. Further data can be obtained using pin, a dynamic binary instrumentation tool. Pintools can be used to obtain similar statistics as PAPI, but advantageously without requiring recompilation of the code. Fine grained routine and instruction level instrumentation is...

  9. ATLAS Offline Software Performance Monitoring and Optimization

    Chauhan, N; The ATLAS collaboration; Kittelmann, T; Langenberg, R; Mandrysch , R; Salzburger, A; Seuster, R; Ritsch, E; Stewart, G; van Eldik, N; Vitillo, R

    2013-01-01

    In a complex multi-developer, multi-package software environment, such as the ATLAS offline Athena framework, tracking the performance of the code can be a non-trivial task in itself. In this paper we describe improvements in the instrumentation of ATLAS offline software that have given considerable insight into the performance of the code and helped to guide optimisation. Code can be instrumented firstly using the PAPI tool, which is a programing interface for accessing hardware performance counters. PAPI events can count floating point operations, cycles and instructions and cache accesses. Triggering PAPI to start/stop counting for each algorithm and processed event gives a good understanding of the whole algorithm level performance of ATLAS code. Further data can be obtained using pin, a dynamic binary instrumentation tool. Pintools can be used to obtain similar statistics as PAPI, but advantageously without requiring recompilation of the code. Fine grained routine and instruction level instrumentation is...

  10. Adaptation of quantum chemistry software for the electronic structure calculations on GPU for solid-state systems

    Gusakov, V.E.; Bel'ko, V.I.; Dorozhkin, N.N.

    2015-01-01

    We report on adaptation of quantum chemistry software - Quantum Espresso and LASTO - for the electronic structure calculations for the complex solid-state systems on the GeForce series GPUs using the nVIDIA CUDA technology. Specifically, protective covering based on transition metal nitrides are considered. (authors)

  11. GPU-Based High-performance Imaging for Mingantu Spectral RadioHeliograph

    Mei, Ying; Wang, Feng; Wang, Wei; Chen, Linjie; Liu, Yingbo; Deng, Hui; Dai, Wei; Liu, Cuiyin; Yan, Yihua

    2018-01-01

    As a dedicated solar radio interferometer, the MingantU SpEctral RadioHeliograph (MUSER) generates massive observational data in the frequency range of 400 MHz-15 GHz. High-performance imaging forms a significantly important aspect of MUSER’s massive data processing requirements. In this study, we implement a practical high-performance imaging pipeline for MUSER data processing. At first, the specifications of the MUSER are introduced and its imaging requirements are analyzed. Referring to the most commonly used radio astronomy software such as CASA and MIRIAD, we then implement a high-performance imaging pipeline based on the Graphics Processing Unit technology with respect to the current operational status of the MUSER. A series of critical algorithms and their pseudo codes, i.e., detection of the solar disk and sky brightness, automatic centering of the solar disk and estimation of the number of iterations for clean algorithms, are proposed in detail. The preliminary experimental results indicate that the proposed imaging approach significantly increases the processing performance of MUSER and generates images with high-quality, which can meet the requirements of the MUSER data processing. Supported by the National Key Research and Development Program of China (2016YFE0100300), the Joint Research Fund in Astronomy (No. U1531132, U1631129, U1231205) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and the Chinese Academy of Sciences (CAS), the National Natural Science Foundation of China (Nos. 11403009 and 11463003).

  12. Asset management: integrated software optimizes production performance

    Polczer, S.

    1998-06-01

    Two new multi-dimensional databases, which expand the `row and column` concept of spreadsheets into multiple categories of data called dimensions, are described. These integrated software packages provide the foundation for industry players such as Poco Petroleum Ltd and Numac Energy Inc to gain a competitive advantage, by overhauling their respective data collection and retrieval systems to allow for timely cost analysis and financial reporting. Energy Warehouse, an on-line analytical processing product marketed by SysGold Ltd, is one of the software products described. It gathers various sources of information, allows advanced searches and generates reports previously unavailable in other conventional financial accounting systems. The second product discussed - the Canadian Upstream Energy System (CUES) - is an on-line analytical processing system developed by Oracle Corporation and Calgary-based Applied Terravision Systems (ATS) Inc. CUES combines Oracle`s universal data server and software development tools with ATS`s upstream financial, land, geotechnical and production applications. The software also allows for optimization of facilities, analysis of production efficiencies and comparison of performance against industry standards.

  13. Antecedents and Moderators of Software Professionals’ Performance

    Shiva Prasad H. C.

    2014-02-01

    Full Text Available Software professionals’ (SPs' performance is often understood narrowly in terms of input–output productivity. This study approaches performance from a broader perspective and examines whether the emotional intelligence competencies (EICs of SPs, the leadership style of team leaders, social capital among team members, and human resource management (HRM practices of software firms affect performance of SPs. It also tests whether the value of and opportunities for knowledge sharing moderate such relationships. Data were collected from 441 Indian SPs in a questionnaire survey. Fifty-five team leaders assessed the performance of SPs, and SPs assessed the other constructs. Results revealed that EICs, transformational leadership style, social capital, and HRM practices positively affect performance. EICs are the most important predictors of performance. Under high (low value of and high (low opportunities for knowledge sharing, the antecedents influencing performance are strengthened (attenuated or nullified. The value of and opportunities for knowledge sharing are quasi-moderators. These findings have significant implications for organizing effective work teams.

  14. Performance analysis and acceleration of explicit integration for large kinetic networks using batched GPU computations

    Shyles, Daniel [University of Tennessee (UT); Dongarra, Jack J. [University of Tennessee, Knoxville (UTK); Guidry, Mike W. [ORNL; Tomov, Stanimire Z. [ORNL; Billings, Jay Jay [ORNL; Brock, Benjamin A. [ORNL; Haidar Ahmad, Azzam A. [ORNL

    2016-09-01

    Abstract—We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work.

  15. Software performance and scalability a quantitative approach

    Liu, Henry H

    2009-01-01

    Praise from the Reviewers:"The practicality of the subject in a real-world situation distinguishes this book from othersavailable on the market."—Professor Behrouz Far, University of Calgary"This book could replace the computer organization texts now in use that every CS and CpEstudent must take. . . . It is much needed, well written, and thoughtful."—Professor Larry Bernstein, Stevens Institute of TechnologyA distinctive, educational text onsoftware performance and scalabilityThis is the first book to take a quantitative approach to the subject of software performance and scalability

  16. Ramses-GPU: Second order MUSCL-Handcock finite volume fluid solver

    Kestener, Pierre

    2017-10-01

    RamsesGPU is a reimplementation of RAMSES (ascl:1011.007) which drops the adaptive mesh refinement (AMR) features to optimize 3D uniform grid algorithms for modern graphics processor units (GPU) to provide an efficient software package for astrophysics applications that do not need AMR features but do require a very large number of integration time steps. RamsesGPU provides an very efficient C++/CUDA/MPI software implementation of a second order MUSCL-Handcock finite volume fluid solver for compressible hydrodynamics as a magnetohydrodynamics solver based on the constraint transport technique. Other useful modules includes static gravity, dissipative terms (viscosity, resistivity), and forcing source term for turbulence studies, and special care was taken to enhance parallel input/output performance by using state-of-the-art libraries such as HDF5 and parallel-netcdf.

  17. Performance Engineering Technology for Scientific Component Software

    Malony, Allen D.

    2007-05-08

    Large-scale, complex scientific applications are beginning to benefit from the use of component software design methodology and technology for software development. Integral to the success of component-based applications is the ability to achieve high-performing code solutions through the use of performance engineering tools for both intra-component and inter-component analysis and optimization. Our work on this project aimed to develop performance engineering technology for scientific component software in association with the DOE CCTTSS SciDAC project (active during the contract period) and the broader Common Component Architecture (CCA) community. Our specific implementation objectives were to extend the TAU performance system and Program Database Toolkit (PDT) to support performance instrumentation, measurement, and analysis of CCA components and frameworks, and to develop performance measurement and monitoring infrastructure that could be integrated in CCA applications. These objectives have been met in the completion of all project milestones and in the transfer of the technology into the continuing CCA activities as part of the DOE TASCS SciDAC2 effort. In addition to these achievements, over the past three years, we have been an active member of the CCA Forum, attending all meetings and serving in several working groups, such as the CCA Toolkit working group, the CQoS working group, and the Tutorial working group. We have contributed significantly to CCA tutorials since SC'04, hosted two CCA meetings, participated in the annual ACTS workshops, and were co-authors on the recent CCA journal paper [24]. There are four main areas where our project has delivered results: component performance instrumentation and measurement, component performance modeling and optimization, performance database and data mining, and online performance monitoring. This final report outlines the achievements in these areas for the entire project period. The submitted progress

  18. The COMPASS Tokamak Plasma Control Software Performance

    Valcarcel, Daniel F.; Neto, André; Carvalho, Ivo S.; Carvalho, Bernardo B.; Fernandes, Horácio; Sousa, Jorge; Janky, Filip; Havlicek, Josef; Beno, Radek; Horacek, Jan; Hron, Martin; Panek, Radomir

    2011-08-01

    The COMPASS tokamak has began operation at the IPP Prague in December 2008. A new control system has been built using an ATCA-based real-time system developed at IST Lisbon. The control software is implemented on top of the MARTe real-time framework attaining control cycles as short as 50 μs, with a jitter of less than 1 μs. The controlled parameters, important for the plasma performance, are the plasma current, position of the plasma current center, boundary shape and horizontal and vertical velocities. These are divided in two control cycles: slow at 500 μs and fast at 50 μs. The project has two phases. First, the software implements a digital controller, similar to the analog one used during the COMPASS-D operation in Culham. In the slow cycle, the plasma current and position are measured and controlled with PID and feedforward controllers, respectively, the shaping magnetic field is preprogrammed. The vertical instability and horizontal equilibrium are controlled with the faster 50-μs cycle PID controllers. The second phase will implement a plasma-shape reconstruction algorithm and controller, aiming at optimized plasma performance. The system was designed to be as modular as possible by breaking the functional requirements of the control system into several independent and specialized modules. This splitting enabled tuning the execution of each system part and to use the modules in a variety of applications with different time constraints. This paper presents the design and overall performance of the COMPASS control software.

  19. GPU-accelerated adjoint algorithmic differentiation

    Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

    2016-03-01

    Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.

  20. Acceleration of PIC simulation with GPU

    Suzuki, Junya; Shimazu, Hironori; Fukazawa, Keiichiro; Den, Mitsue

    2011-01-01

    Particle-in-cell (PIC) is a simulation technique for plasma physics. The large number of particles in high-resolution plasma simulation increases the volume computation required, making it vital to increase computation speed. In this study, we attempt to accelerate computation speed on graphics processing units (GPUs) using KEMPO, a PIC simulation code package. We perform two tests for benchmarking, with small and large grid sizes. In these tests, we run KEMPO1 code using a CPU only, both a CPU and a GPU, and a GPU only. The results showed that performance using only a GPU was twice that of using a CPU alone. While, execution time for using both a CPU and GPU is comparable to the tests with a CPU alone, because of the significant bottleneck in communication between the CPU and GPU. (author)

  1. Physical modeling and high-performance GPU computing for characterization, interception, and disruption of hazardous near-Earth objects

    Kaplinger, Brian Douglas

    For the past few decades, both the scientific community and the general public have been becoming more aware that the Earth lives in a shooting gallery of small objects. We classify all of these asteroids and comets, known or unknown, that cross Earth's orbit as near-Earth objects (NEOs). A look at our geologic history tells us that NEOs have collided with Earth in the past, and we expect that they will continue to do so. With thousands of known NEOs crossing the orbit of Earth, there has been significant scientific interest in developing the capability to deflect an NEO from an impacting trajectory. This thesis applies the ideas of Smoothed Particle Hydrodynamics (SPH) theory to the NEO disruption problem. A simulation package was designed that allows efficacy simulation to be integrated into the mission planning and design process. This is done by applying ideas in high-performance computing (HPC) on the computer graphics processing unit (GPU). Rather than prove a concept through large standalone simulations on a supercomputer, a highly parallel structure allows for flexible, target dependent questions to be resolved. Built around nonclassified data and analysis, this computer package will allow academic institutions to better tackle the issue of NEO mitigation effectiveness.

  2. R-GPU : A reconfigurable GPU architecture

    van den Braak, G.J.; Corporaal, H.

    2016-01-01

    Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU's Single Instruction/Multiple Threads

  3. TH-A-19A-04: Latent Uncertainties and Performance of a GPU-Implemented Pre-Calculated Track Monte Carlo Method

    Renaud, M; Seuntjens, J; Roberge, D

    2014-01-01

    Purpose: Assessing the performance and uncertainty of a pre-calculated Monte Carlo (PMC) algorithm for proton and electron transport running on graphics processing units (GPU). While PMC methods have been described in the past, an explicit quantification of the latent uncertainty arising from recycling a limited number of tracks in the pre-generated track bank is missing from the literature. With a proper uncertainty analysis, an optimal pre-generated track bank size can be selected for a desired dose calculation uncertainty. Methods: Particle tracks were pre-generated for electrons and protons using EGSnrc and GEANT4, respectively. The PMC algorithm for track transport was implemented on the CUDA programming framework. GPU-PMC dose distributions were compared to benchmark dose distributions simulated using general-purpose MC codes in the same conditions. A latent uncertainty analysis was performed by comparing GPUPMC dose values to a “ground truth” benchmark while varying the track bank size and primary particle histories. Results: GPU-PMC dose distributions and benchmark doses were within 1% of each other in voxels with dose greater than 50% of Dmax. In proton calculations, a submillimeter distance-to-agreement error was observed at the Bragg Peak. Latent uncertainty followed a Poisson distribution with the number of tracks per energy (TPE) and a track bank of 20,000 TPE produced a latent uncertainty of approximately 1%. Efficiency analysis showed a 937× and 508× gain over a single processor core running DOSXYZnrc for 16 MeV electrons in water and bone, respectively. Conclusion: The GPU-PMC method can calculate dose distributions for electrons and protons to a statistical uncertainty below 1%. The track bank size necessary to achieve an optimal efficiency can be tuned based on the desired uncertainty. Coupled with a model to calculate dose contributions from uncharged particles, GPU-PMC is a candidate for inverse planning of modulated electron radiotherapy

  4. Distributed GPU Computing in GIScience

    Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.

    2013-12-01

    Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE

  5. GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison

    Ma, Chao; Wang, Lirong; Xie, Xiang-Qun

    2012-01-01

    Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447

  6. Computing treewidth on the GPU

    Van Der Zanden, Tom C.; Bodlaender, Hans L.

    2018-01-01

    We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O∗(2n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic

  7. Computing treewidth on the GPU

    van der Zanden, T.C.; Bodlaender, Hans L.

    2017-01-01

    We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an $O^*(2^{n})$-time algorithm that explores the elimination orderings of the graph using a Held-Karp like

  8. Towards a Theory of Affect and Software Developers' Performance

    Graziotin, Daniel

    2016-01-01

    For more than thirty years, it has been claimed that a way to improve software developers' productivity and software quality is to focus on people. The underlying assumption seems to be that "happy and satisfied software developers perform better". More specifically, affects-emotions and moods-have an impact on cognitive activities and the working performance of individuals. Development tasks are undertaken heavily through cognitive processes, yet software engineering research (SE) lacks theo...

  9. High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

    Abdelfattah, Ahmad; Ltaief, Hatem; Keyes, David E.

    2015-01-01

    -block structure. While these optimizations are important for high performance dense kernel executions, they are even more critical when dealing with sparse linear algebra operations. The most time-consuming phase of many multicomponent applications, such as models

  10. Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

    Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

    2010-10-01

    Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  11. GPU Lossless Hyperspectral Data Compression System for Space Applications

    Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

    2012-01-01

    On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.

  12. Performance evaluation of H.264/AVC decoding and visualization using the GPU

    Pieters, Bart; Van Rijsselbergen, Dieter; De Neve, Wesley; Van de Walle, Rik

    2007-01-01

    The coding efficiency of the H.264/AVC standard makes the decoding process computationally demanding. This has limited the availability of cost-effective, high-performance solutions. Modern computers are typically equipped with powerful yet cost-effective Graphics Processing Units (GPUs) to accelerate graphics operations. These GPUs can be addressed by means of a 3-D graphics API such as Microsoft Direct3D or OpenGL, using programmable shaders as generic processing units for vector data. The ...

  13. GPU computing and applications

    See, Simon

    2015-01-01

    This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.

  14. High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

    Abdelfattah, Ahmad

    2015-07-25

    Leveraging optimization techniques (e.g., register blocking and double buffering) introduced in the context of KBLAS, a Level 2 BLAS high performance library on GPUs, the authors implement dense matrix-vector multiplications within a sparse-block structure. While these optimizations are important for high performance dense kernel executions, they are even more critical when dealing with sparse linear algebra operations. The most time-consuming phase of many multicomponent applications, such as models of reacting flows or petroleum reservoirs, is the solution at each implicit time step of large, sparse spatially structured or unstructured linear systems. The standard method is a preconditioned Krylov solver. The Sparse Matrix-Vector multiplication (SpMV) is, in turn, one of the most time-consuming operations in such solvers. Because there is no data reuse of the elements of the matrix within a single SpMV, kernel performance is limited by the speed at which data can be transferred from memory to registers, making the bus bandwidth the major bottleneck. On the other hand, in case of a multi-species model, the resulting Jacobian has a dense block structure. For contemporary petroleum reservoir simulations, the block size typically ranges from three to a few dozen among different models, and still larger blocks are relevant within adaptively model-refined regions of the domain, though generally the size of the blocks, related to the number of conserved species, is constant over large regions within a given model. This structure can be exploited beyond the convenience of a block compressed row data format, because it offers opportunities to hide the data motion with useful computations. The new SpMV kernel outperforms existing state-of-the-art implementations on single and multi-GPUs using matrices with dense block structure representative of porous media applications with both structured and unstructured multi-component grids.

  15. Study on the BES Ⅲ offline software performance

    Zhang Xiaomei; Sun Gongxing

    2011-01-01

    Performance monitor and analysis on the BESⅢ offline software system is very useful for the software optimization and the improvement of CPU and memory usage. It presented a feasible performance monitoring service based on GAUDI, and carried out performance tests and analysis on the BESⅢ simulation and reconstruction with the service. (authors)

  16. Performance Analysis of FEM Algorithmson GPU and Many-Core Architectures

    Khurram, Rooh

    2015-04-27

    The roadmaps of the leading supercomputer manufacturers are based on hybrid systems, which consist of a mix of conventional processors and accelerators. This trend is mainly due to the fact that the power consumption cost of the future cpu-only Exascale systems will be unsustainable, thus accelerators such as graphic processing units (GPUs) and many-integrated-core (MIC) will likely be the integral part of the TOP500 (http://www.top500.org/) supercomputers, beyond 2020. The emerging supercomputer architecture will bring new challenges for the code developers. Continuum mechanics codes will particularly be affected, because the traditional synchronous implicit solvers will probably not scale on hybrid Exascale machines. In the previous study[1], we reported on the performance of a conjugate gradient based mesh motion algorithm[2]on Sandy Bridge, Xeon Phi, and K20c. In the present study we report on the comparative study of finite element codes, using PETSC and AmgX solvers on CPU and GPUs, respectively [3,4]. We believe this study will be a good starting point for FEM code developers, who are contemplating a CPU to accelerator transition.

  17. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization.

    Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan

    2017-08-04

    This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.

  18. Performance Evaluation of Software Routers with VPN Features

    H. Redžović

    2017-11-01

    Full Text Available This paper presents implementation and analysis of the VPN software router which is based on Quagga and strongSwan open-source software tools. We validated the functionalities of strongSwan and Quagga in realistic environment which include scenarios with link failures. Also, we measured and analyzed the performance of encryption and hash algorithms supported by strongSwan software, in order to advise an optimal VPN configuration that provides the best performance.

  19. Application of GPU to computational multiphase fluid dynamics

    Nagatake, T; Kunugi, T

    2010-01-01

    The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.

  20. NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

    Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto

    2015-02-13

    In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the

  1. The COMPASS Tokamak Plasma Control Software Performance

    Valcárcel, D.F.; Neto, A.; Carvalho, I.S.; Carvalho, B.B.; Fernandes, H.; Sousa, J.; Janky, F.; Havlíček, Josef; Beňo, R.; Horáček, Jan; Hron, Martin; Pánek, Radomír

    2011-01-01

    Roč. 58, č. 4 (2011), s. 1490-1496 ISSN 0018-9499. [Real Time Conference, RT10/17th./. Lisboa, 24.05.2010-28.05.2010] R&D Projects: GA MŠk 7G09042; GA ČR GD202/08/H057 Institutional research plan: CEZ:AV0Z20430508 Keywords : Real-Time * ATCA * Data Acquisition * Plasma Control Software Subject RIV: BL - Plasma and Gas Discharge Physics Impact factor: 1.447, year: 2011 http://dx.doi.org/10.1109/TNS.2011.2143726

  2. The CMS software performance at the start of data taking

    Benelli, Gabriele

    2009-01-01

    The CMS software framework (CMSSW) is a complex project evolving very rapidly as the first LHC colliding beams approach. The computing requirements constrain performance in terms of CPU time, memory footprint and event size on disk to allow for planning and managing the computing infrastructure necessary to handle the needs of the experiment. A performance suite of tools has been developed to track all aspects of code performance, through the software release cycles, allowing for regression and guiding code development for optimization. In this talk, we describe the CMSSW performance suite tools used and present some sample performance results from the release integration process for the CMS software.

  3. Peregrine Software Toolchains | High-Performance Computing | NREL

    Group (PGI) C/C++ and Fortran (partially supported) The PGI Accelerator compilers include NVIDIA GPU support via the directive-based OpenACC 2.5 programming model, as well as full support for NVIDIA CUDA C

  4. Haptic Feedback for the GPU-based Surgical Simulator

    Sørensen, Thomas Sangild; Mosegaard, Jesper

    2006-01-01

    The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....

  5. Asset management -- Integrated software optimizes production performance

    Polczer, S.

    1998-01-01

    Developments in data collection and retrieval systems to allow timely cost analysis, financial reporting and production management are discussed. One of the most important new OLAP (on-line analytical processing) products is Energy Warehouse which gathers field information from various sources, allows advanced searches, and generates reports previously unavailable in other conventional financial accounting systems. Another OLAP-based system, the Canadian Upstream Energy System (CUES), was developed by the Oracle Corporation and the Calgary-based Applied Terravision Systems (ATS) Inc. CUES combines Oracle's universal data server software development tools with ATS's upstream financial, land, geotechnical and production applications. ATS also developed a product called IDPMARS (Integrated Daily Production Management Accounting Reporting System). It interfaces with CUES to link working interests, government royalties, administration, facility charges, lifting costs, transportation tooling, and customers by integrating field data collection systems with financial accounting

  6. Asset management -- Integrated software optimizes production performance

    Polczer, S.

    1998-10-01

    Developments in data collection and retrieval systems to allow timely cost analysis, financial reporting and production management are discussed. One of the most important new OLAP (on-line analytical processing) products is Energy Warehouse which gathers field information from various sources, allows advanced searches, and generates reports previously unavailable in other conventional financial accounting systems. Another OLAP-based system, the Canadian Upstream Energy System (CUES), was developed by the Oracle Corporation and the Calgary-based Applied Terravision Systems (ATS) Inc. CUES combines Oracle`s universal data server software development tools with ATS`s upstream financial, land, geotechnical and production applications. ATS also developed a product called IDPMARS (Integrated Daily Production Management Accounting Reporting System). It interfaces with CUES to link working interests, government royalties, administration, facility charges, lifting costs, transportation tooling, and customers by integrating field data collection systems with financial accounting.

  7. Seismic Shot Processing on GPU

    Johansen, Owe

    2009-01-01

    Today s petroleum industry demand an ever increasing amount of compu- tational resources. Seismic processing applications in use by these types of companies have generally been using large clusters of compute nodes, whose only computing resource has been the CPU. However, using Graphics Pro- cessing Units (GPU) for general purpose programming is these days becoming increasingly more popular in the high performance computing area. In 2007, NVIDIA corporation launched their framework for develo...

  8. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

    Mishra, Alok [Stony Brook Univ., Stony Brook, NY (United States); Li, Lingda [Brookhaven National Lab. (BNL), Upton, NY (United States); Kong, Martin [Brookhaven National Lab. (BNL), Upton, NY (United States); Finkel, Hal [Argonne National Lab. (ANL), Argonne, IL (United States); Chapman, Barbara [Stony Brook Univ., Stony Brook, NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)

    2017-01-01

    Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.

  9. Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

    Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

    2016-06-01

    CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.

  10. GPU Accelerated Vector Median Filter

    Aras, Rifat; Shen, Yuzhong

    2011-01-01

    Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .

  11. A proposal for performing software safety hazard analysis

    Lawrence, J.D.; Gallagher, J.M.

    1997-01-01

    Techniques for analyzing the safety and reliability of analog-based electronic protection systems that serve to mitigate hazards in process control systems have been developed over many years, and are reasonably understood. An example is the protection system in a nuclear power plant. The extension of these techniques to systems which include digital computers is not well developed, and there is little consensus among software engineering experts and safety experts on how to analyze such systems. One possible technique is to extend hazard analysis to include digital computer-based systems. Software is frequently overlooked during system hazard analyses, but this is unacceptable when the software is in control of a potentially hazardous operation. In such cases, hazard analysis should be extended to fully cover the software. A method for performing software hazard analysis is proposed in this paper. The method concentrates on finding hazards during the early stages of the software life cycle, using an extension of HAZOP

  12. Simulating spin models on GPU

    Weigel, Martin

    2011-09-01

    Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.

  13. GPU-accelerated Lattice Boltzmann method for anatomical extraction in patient-specific computational hemodynamics

    Yu, H.; Wang, Z.; Zhang, C.; Chen, N.; Zhao, Y.; Sawchuk, A. P.; Dalsing, M. C.; Teague, S. D.; Cheng, Y.

    2014-11-01

    Existing research of patient-specific computational hemodynamics (PSCH) heavily relies on software for anatomical extraction of blood arteries. Data reconstruction and mesh generation have to be done using existing commercial software due to the gap between medical image processing and CFD, which increases computation burden and introduces inaccuracy during data transformation thus limits the medical applications of PSCH. We use lattice Boltzmann method (LBM) to solve the level-set equation over an Eulerian distance field and implicitly and dynamically segment the artery surfaces from radiological CT/MRI imaging data. The segments seamlessly feed to the LBM based CFD computation of PSCH thus explicit mesh construction and extra data management are avoided. The LBM is ideally suited for GPU (graphic processing unit)-based parallel computing. The parallel acceleration over GPU achieves excellent performance in PSCH computation. An application study will be presented which segments an aortic artery from a chest CT dataset and models PSCH of the segmented artery.

  14. Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

    Yi-Shan Lin

    2017-01-01

    Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.

  15. Teaching Software Developers to Perform UX Tasks

    Øvad, Tina; Bornoe, Nis; Larsen, Lars Bo

    2015-01-01

    . This is done via an action research study where the developers were provided with material concerning a modified AB usability test, by training them in performing this type of work, and by using their feedback to improve the method and the material. The overall result of the study is positive and it is found...... that by using the developers' feedback in the modification process, the method has truly become applicable within an agile, industrial setting. In combination with a guideline and template this has induced the developers to feel confident in independently performing this type of work....

  16. A survey and measurement study of GPU DVFS on energy conservation

    Xinxin Mei

    2017-05-01

    Full Text Available Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.

  17. Validation of GPU based TomoTherapy dose calculation engine.

    Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond

    2012-04-01

    The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.

  18. Building quality into performance and safety assessment software

    Wojciechowski, L.C.

    2011-01-01

    Quality assurance is integrated throughout the development lifecycle for performance and safety assessment software. The software used in the performance and safety assessment of a Canadian deep geological repository (DGR) follows the CSA quality assurance standard CSA-N286.7 [1], Quality Assurance of Analytical, Scientific and Design Computer Programs for Nuclear Power Plants. Quality assurance activities in this standard include tasks such as verification and inspection; however, much more is involved in producing a quality software computer program. The types of errors found with different verification methods are described. The integrated quality process ensures that defects are found and corrected as early as possible. (author)

  19. Semi-automatic tool to ease the creation and optimization of GPU programs

    Jepsen, Jacob

    2014-01-01

    We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....

  20. Generating Billion-Edge Scale-Free Networks in Seconds: Performance Study of a Novel GPU-based Preferential Attachment Model

    Perumalla, Kalyan S. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Alam, Maksudul [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

    2017-10-01

    A novel parallel algorithm is presented for generating random scale-free networks using the preferential-attachment model. The algorithm, named cuPPA, is custom-designed for single instruction multiple data (SIMD) style of parallel processing supported by modern processors such as graphical processing units (GPUs). To the best of our knowledge, our algorithm is the first to exploit GPUs, and also the fastest implementation available today, to generate scale free networks using the preferential attachment model. A detailed performance study is presented to understand the scalability and runtime characteristics of the cuPPA algorithm. In one of the best cases, when executed on an NVidia GeForce 1080 GPU, cuPPA generates a scale free network of a billion edges in less than 2 seconds.

  1. High Performance Computing Software Applications for Space Situational Awareness

    Giuliano, C.; Schumacher, P.; Matson, C.; Chun, F.; Duncan, B.; Borelli, K.; Desonia, R.; Gusciora, G.; Roe, K.

    The High Performance Computing Software Applications Institute for Space Situational Awareness (HSAI-SSA) has completed its first full year of applications development. The emphasis of our work in this first year was in improving space surveillance sensor models and image enhancement software. These applications are the Space Surveillance Network Analysis Model (SSNAM), the Air Force Space Fence simulation (SimFence), and physically constrained iterative de-convolution (PCID) image enhancement software tool. Specifically, we have demonstrated order of magnitude speed-up in those codes running on the latest Cray XD-1 Linux supercomputer (Hoku) at the Maui High Performance Computing Center. The software applications improvements that HSAI-SSA has made, has had significant impact to the warfighter and has fundamentally changed the role of high performance computing in SSA.

  2. Radio-science performance analysis software

    Morabito, D. D.; Asmar, S. W.

    1995-02-01

    The Radio Science Systems Group (RSSG) provides various support functions for several flight project radio-science teams. Among these support functions are uplink and sequence planning, real-time operations monitoring and support, data validation, archiving and distribution functions, and data processing and analysis. This article describes the support functions that encompass radio-science data performance analysis. The primary tool used by the RSSG to fulfill this support function is the STBLTY program set. STBLTY is used to reconstruct observable frequencies and calculate model frequencies, frequency residuals, frequency stability in terms of Allan deviation, reconstructed phase, frequency and phase power spectral density, and frequency drift rates. In the case of one-way data, using an ultrastable oscillator (USO) as a frequency reference, the program set computes the spacecraft transmitted frequency and maintains a database containing the in-flight history of the USO measurements. The program set also produces graphical displays. Some examples and discussions on operating the program set on Galileo and Ulysses data will be presented.

  3. The Ettention software package

    Dahmen, Tim; Marsalek, Lukas; Marniok, Nico; Turoňová, Beata; Bogachev, Sviatoslav; Trampert, Patrick; Nickels, Stefan; Slusallek, Philipp

    2016-01-01

    We present a novel software package for the problem “reconstruction from projections” in electron microscopy. The Ettention framework consists of a set of modular building-blocks for tomographic reconstruction algorithms. The well-known block iterative reconstruction method based on Kaczmarz algorithm is implemented using these building-blocks, including adaptations specific to electron tomography. Ettention simultaneously features (1) a modular, object-oriented software design, (2) optimized access to high-performance computing (HPC) platforms such as graphic processing units (GPU) or many-core architectures like Xeon Phi, and (3) accessibility to microscopy end-users via integration in the IMOD package and eTomo user interface. We also provide developers with a clean and well-structured application programming interface (API) that allows for extending the software easily and thus makes it an ideal platform for algorithmic research while hiding most of the technical details of high-performance computing. - Highlights: • Novel software package for “reconstruction from projections” in electron microscopy. • Support for high-resolution reconstructions on iterative reconstruction algorithms. • Support for CPU, GPU and Xeon Phi. • Integration in the IMOD software. • Platform for algorithm researchers: object oriented, modular design.

  4. The Ettention software package

    Dahmen, Tim, E-mail: Tim.Dahmen@dfki.de [German Research Center for Artificial Intelligence GmbH (DFKI), 66123 Saarbrücken (Germany); Saarland University, 66123 Saarbrücken (Germany); Marsalek, Lukas [Eyen SE, Na Nivách 1043/16, 141 00 Praha 4 (Czech Republic); Saarland University, 66123 Saarbrücken (Germany); Marniok, Nico [Saarland University, 66123 Saarbrücken (Germany); Turoňová, Beata [Saarland University, 66123 Saarbrücken (Germany); IMPRS-CS, Max-Planck Institute for Informatics, Campus E 1.4, 66123 Saarbrücken (Germany); Bogachev, Sviatoslav [Saarland University, 66123 Saarbrücken (Germany); Trampert, Patrick; Nickels, Stefan [German Research Center for Artificial Intelligence GmbH (DFKI), 66123 Saarbrücken (Germany); Slusallek, Philipp [German Research Center for Artificial Intelligence GmbH (DFKI), 66123 Saarbrücken (Germany); Saarland University, 66123 Saarbrücken (Germany)

    2016-02-15

    We present a novel software package for the problem “reconstruction from projections” in electron microscopy. The Ettention framework consists of a set of modular building-blocks for tomographic reconstruction algorithms. The well-known block iterative reconstruction method based on Kaczmarz algorithm is implemented using these building-blocks, including adaptations specific to electron tomography. Ettention simultaneously features (1) a modular, object-oriented software design, (2) optimized access to high-performance computing (HPC) platforms such as graphic processing units (GPU) or many-core architectures like Xeon Phi, and (3) accessibility to microscopy end-users via integration in the IMOD package and eTomo user interface. We also provide developers with a clean and well-structured application programming interface (API) that allows for extending the software easily and thus makes it an ideal platform for algorithmic research while hiding most of the technical details of high-performance computing. - Highlights: • Novel software package for “reconstruction from projections” in electron microscopy. • Support for high-resolution reconstructions on iterative reconstruction algorithms. • Support for CPU, GPU and Xeon Phi. • Integration in the IMOD software. • Platform for algorithm researchers: object oriented, modular design.

  5. Performance testing of 3D point cloud software

    Varela-González, M.; González-Jorge, H.; Riveiro, B.; Arias, P.

    2013-10-01

    LiDAR systems are being used widely in recent years for many applications in the engineering field: civil engineering, cultural heritage, mining, industry and environmental engineering. One of the most important limitations of this technology is the large computational requirements involved in data processing, especially for large mobile LiDAR datasets. Several software solutions for data managing are available in the market, including open source suites, however, users often unknown methodologies to verify their performance properly. In this work a methodology for LiDAR software performance testing is presented and four different suites are studied: QT Modeler, VR Mesh, AutoCAD 3D Civil and the Point Cloud Library running in software developed at the University of Vigo (SITEGI). The software based on the Point Cloud Library shows better results in the loading time of the point clouds and CPU usage. However, it is not as strong as commercial suites in working set and commit size tests.

  6. Performance comparison between ISCSI and other hardware and software solutions

    Gug, M

    2003-01-01

    We report on our investigations on some technologies that can be used to build disk servers and networks of disk servers using commodity hardware and software solutions. It focuses on the performance that can be achieved by these systems and gives measured figures for different configurations. It is divided into two parts : iSCSI and other technologies and hardware and software RAID solutions. The first part studies different technologies that can be used by clients to access disk servers using a gigabit ethernet network. It covers block access technologies (iSCSI, hyperSCSI, ENBD). Experimental figures are given for different numbers of clients and servers. The second part compares a system based on 3ware hardware RAID controllers, a system using linux software RAID and IDE cards and a system mixing both hardware RAID and software RAID. Performance measurements for reading and writing are given for different RAID levels.

  7. The impact of new accelerator control software on LEP performance

    Bailey, R.; Belk, A.; Collier, P.; Lamont, M.; Rigk, G. de; Tarrant, M.

    1993-01-01

    After the first year of running LEP, it became apparent that a new generation of application software would be required for efficient long term exploitation of the accelerator. In response to this need, a suite of accelerator control software has been developed, which is new both in style and functionality. During 1992 this software has been extensively used for driving LEP in many different operational modes, which include several different optics, polarisation runs at different energies and 8 bunch operation with Pretzels. The software has performed well and has undoubtedly enhanced the efficiency of accelerator operations. In particular the turnaround time has been significantly reduced, giving an increase of around 20% in the integrated luminosity for the year. Furthermore the software has made the accelerator accessible to less experienced operators. After outlining the development strategy, the overall functionality and performance of the software is discussed, with particular emphasis on improvements in operating efficiency. Some evaluation of the performance and reliability of ORACLE as an on-line database is also given

  8. New software for improving performance in wind farm operations

    Collins, Mark [Ekho for Wind (Canada)

    2011-07-01

    The performance of wind farms depends on multiple field and business systems. This makes operational planning difficult because of so many data being in separate systems, duplication of data and the impossibility of gathering all relevant data together in one place. The aim of this paper is to present a new software, Ekho for Wind, which helps improve performance in wind farm operations by providing features such as high level views, performance analysis, downtime tracking, quality data management and forecast generation. This new software provides operational intelligence which offers incentives for continuous improvement. Ekho for Wind can bring such benefits as maximization of generation, increased lifetime of assets, minimization of costs and increased profitability. This presentation introduced a new software for improving the performance of wind farms and the lifetime of assets, resulting in significant payback.

  9. GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

    Ronglin Jiang

    2014-01-01

    Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

  10. GASPRNG: GPU accelerated scalable parallel random number generator library

    Gao, Shuang; Peterson, Gregory D.

    2013-04-01

    workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.

  11. Validation of geotechnical software for repository performance assessment

    LeGore, T.; Hoover, J.D.; Khaleel, R.; Thornton, E.C.; Anantatmula, R.P.; Lanigan, D.C.

    1989-01-01

    An important step in the characterization of a high level nuclear waste repository is to demonstrate that geotechnical software, used in performance assessment, correctly models validation. There is another type of validation, called software validation. It is based on meeting the requirements of specifications documents (e.g. IEEE specifications) and does not directly address the correctness of the specifications. The process of comparing physical experimental results with the predicted results should incorporate an objective measure of the level of confidence regarding correctness. This paper reports on a methodology developed that allows the experimental uncertainties to be explicitly included in the comparison process. The methodology also allows objective confidence levels to be associated with the software. In the event of a poor comparison, the method also lays the foundation for improving the software

  12. Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

    Ichitaro Yamazaki

    2015-01-01

    of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%–50% using the GPU.

  13. Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications

    Seongseop Kim

    2017-11-01

    Full Text Available A particle filter (PF has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT, and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA. The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large

  14. Fully 3D GPU PET reconstruction

    Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)

    2011-08-21

    Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.

  15. Fully 3D GPU PET reconstruction

    Herraiz, J.L.; Espana, S.; Cal-Gonzalez, J.; Vaquero, J.J.; Desco, M.; Udias, J.M.

    2011-01-01

    Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.

  16. Performance evaluation software moving object detection and tracking in videos

    Karasulu, Bahadir

    2013-01-01

    Performance Evaluation Software: Moving Object Detection and Tracking in Videos introduces a software approach for the real-time evaluation and performance comparison of the methods specializing in moving object detection and/or tracking (D&T) in video processing. Digital video content analysis is an important item for multimedia content-based indexing (MCBI), content-based video retrieval (CBVR) and visual surveillance systems. There are some frequently-used generic algorithms for video object D&T in the literature, such as Background Subtraction (BS), Continuously Adaptive Mean-shift (CMS),

  17. Performance testing of LiDAR exploitation software

    Varela-González, M.; González-Jorge, H.; Riveiro, B.; Arias, P.

    2013-04-01

    Mobile LiDAR systems are being used widely in recent years for many applications in the field of geoscience. One of most important limitations of this technology is the large computational requirements involved in data processing. Several software solutions for data processing are available in the market, but users are often unknown about the methodologies to verify their performance accurately. In this work a methodology for LiDAR software performance testing is presented and six different suites are studied: QT Modeler, AutoCAD Civil 3D, Mars 7, Fledermaus, Carlson and TopoDOT (all of them in x64). Results depict as QTModeler, TopoDOT and AutoCAD Civil 3D allow the loading of large datasets, while Fledermaus, Mars7 and Carlson do not achieve these powerful performance. AutoCAD Civil 3D needs large loading time in comparison with the most powerful softwares such as QTModeler and TopoDOT. Carlson suite depicts the poorest results among all the softwares under study, where point clouds larger than 5 million points cannot be loaded and loading time is very large in comparison with the other suites even for the smaller datasets. AutoCAD Civil 3D, Carlson and TopoDOT show more threads than other softwares like QTModeler, Mars7 and Fledermaus.

  18. GPU-accelerated computation of electron transfer.

    Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco

    2012-11-05

    Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.

  19. Parallel GPU implementation of PWR reactor burnup

    Heimlich, A.; Silva, F.C.; Martinez, A.S.

    2016-01-01

    Highlights: • Three GPU algorithms used to evaluate the burn-up in a PWR reactor. • Exhibit speed improvement exceeding 200 times over the sequential. • The C++ container is expansible to accept new nuclides chains. - Abstract: This paper surveys three methods, implemented for multi-core CPU and graphic processor unit (GPU), to evaluate the fuel burn-up in a pressurized light water nuclear reactor (PWR) using the solutions of a large system of coupled ordinary differential equations. The reactor physics simulation of a PWR reactor spends a long execution time with burnup calculations, so performance improvement using GPU can imply in better core design and thus extended fuel life cycle. The results of this study exhibit speed improvement exceeding 200 times over the sequential solver, within 1% accuracy.

  20. High performance pseudo-analytical simulation of multi-object adaptive optics over multi-GPU systems

    Abdelfattah, Ahmad; Gendron, É ric; Gratadour, Damien; Keyes, David E.; Ltaief, Hatem; Sevin, Arnaud; Vidal, Fabrice

    2014-01-01

    Multi-object adaptive optics (MOAO) is a novel adaptive optics (AO) technique dedicated to the special case of wide-field multi-object spectrographs (MOS). It applies dedicated wavefront corrections to numerous independent tiny patches spread over a large field of view (FOV). The control of each deformable mirror (DM) is done individually using a tomographic reconstruction of the phase based on measurements from a number of wavefront sensors (WFS) pointing at natural and artificial guide stars in the field. The output of this study helps the design of a new instrument called MOSAIC, a multi-object spectrograph proposed for the European Extremely Large Telescope (E-ELT). We have developed a novel hybrid pseudo-analytical simulation scheme that allows us to accurately simulate in detail the tomographic problem. The main challenge resides in the computation of the tomographic reconstructor, which involves pseudo-inversion of a large dense symmetric matrix. The pseudo-inverse is computed using an eigenvalue decomposition, based on the divide and conquer algorithm, on multicore systems with multi-GPUs. Thanks to a new symmetric matrix-vector product (SYMV) multi-GPU kernel, our overall implementation scores significant speedups over standard numerical libraries on multicore, like Intel MKL, and up to 60% speedups over the standard MAGMA implementation on 8 Kepler K20c GPUs. At 40,000 unknowns, this appears to be the largest-scale tomographic AO matrix solver submitted to computation, to date, to our knowledge and opens new research directions for extreme scale AO simulations. © 2014 Springer International Publishing Switzerland.

  1. Conference on High Performance Software for Nonlinear Optimization

    Murli, Almerico; Pardalos, Panos; Toraldo, Gerardo

    1998-01-01

    This book contains a selection of papers presented at the conference on High Performance Software for Nonlinear Optimization (HPSN097) which was held in Ischia, Italy, in June 1997. The rapid progress of computer technologies, including new parallel architec­ tures, has stimulated a large amount of research devoted to building software environments and defining algorithms able to fully exploit this new computa­ tional power. In some sense, numerical analysis has to conform itself to the new tools. The impact of parallel computing in nonlinear optimization, which had a slow start at the beginning, seems now to increase at a fast rate, and it is reasonable to expect an even greater acceleration in the future. As with the first HPSNO conference, the goal of the HPSN097 conference was to supply a broad overview of the more recent developments and trends in nonlinear optimization, emphasizing the algorithmic and high performance software aspects. Bringing together new computational methodologies with theoretical...

  2. Mitigating the controller performance bottlenecks in Software Defined Networks

    Caba, Cosmin Marius; Soler, José

    2016-01-01

    The centralization of the control plane decision logic in Software Defined Networking (SDN) has raised concerns regarding the performance of the SDN Controller (SDNC) when the network scales up. A number of solutions have been proposed in the literature to address these concerns. This paper...

  3. Component-based software for high-performance scientific computing

    Alexeev, Yuri; Allan, Benjamin A; Armstrong, Robert C; Bernholdt, David E; Dahlgren, Tamara L; Gannon, Dennis; Janssen, Curtis L; Kenny, Joseph P; Krishnan, Manojkumar; Kohl, James A; Kumfert, Gary; McInnes, Lois Curfman; Nieplocha, Jarek; Parker, Steven G; Rasmussen, Craig; Windus, Theresa L

    2005-01-01

    Recent advances in both computational hardware and multidisciplinary science have given rise to an unprecedented level of complexity in scientific simulation software. This paper describes an ongoing grass roots effort aimed at addressing complexity in high-performance computing through the use of Component-Based Software Engineering (CBSE). Highlights of the benefits and accomplishments of the Common Component Architecture (CCA) Forum and SciDAC ISIC are given, followed by an illustrative example of how the CCA has been applied to drive scientific discovery in quantum chemistry. Thrusts for future research are also described briefly.

  4. Component-based software for high-performance scientific computing

    Alexeev, Yuri; Allan, Benjamin A; Armstrong, Robert C; Bernholdt, David E; Dahlgren, Tamara L; Gannon, Dennis; Janssen, Curtis L; Kenny, Joseph P; Krishnan, Manojkumar; Kohl, James A; Kumfert, Gary; McInnes, Lois Curfman; Nieplocha, Jarek; Parker, Steven G; Rasmussen, Craig; Windus, Theresa L

    2005-01-01

    Recent advances in both computational hardware and multidisciplinary science have given rise to an unprecedented level of complexity in scientific simulation software. This paper describes an ongoing grass roots effort aimed at addressing complexity in high-performance computing through the use of Component-Based Software Engineering (CBSE). Highlights of the benefits and accomplishments of the Common Component Architecture (CCA) Forum and SciDAC ISIC are given, followed by an illustrative example of how the CCA has been applied to drive scientific discovery in quantum chemistry. Thrusts for future research are also described briefly

  5. Software life cycle dynamic simulation model: The organizational performance submodel

    Tausworthe, Robert C.

    1985-01-01

    The submodel structure of a software life cycle dynamic simulation model is described. The software process is divided into seven phases, each with product, staff, and funding flows. The model is subdivided into an organizational response submodel, a management submodel, a management influence interface, and a model analyst interface. The concentration here is on the organizational response model, which simulates the performance characteristics of a software development subject to external and internal influences. These influences emanate from two sources: the model analyst interface, which configures the model to simulate the response of an implementing organization subject to its own internal influences, and the management submodel that exerts external dynamic control over the production process. A complete characterization is given of the organizational response submodel in the form of parameterized differential equations governing product, staffing, and funding levels. The parameter values and functions are allocated to the two interfaces.

  6. Multi-core and GPU accelerated simulation of a radial star target imaged with equivalent t-number circular and Gaussian pupils

    Greynolds, Alan W.

    2013-09-01

    Results from the GelOE optical engineering software are presented for the through-focus, monochromatic coherent and polychromatic incoherent imaging of a radial "star" target for equivalent t-number circular and Gaussian pupils. The FFT-based simulations are carried out using OpenMP threading on a multi-core desktop computer, with and without the aid of a many-core NVIDIA GPU accessing its cuFFT library. It is found that a custom FFT optimized for the 12-core host has similar performance to a simply implemented 256-core GPU FFT. A more sophisticated version of the latter but tuned to reduce overhead on a 448-core GPU is 20 to 28 times faster than a basic FFT implementation running on one CPU core.

  7. A Knowledge-based Environment for Software Process Performance Analysis

    Natália Chaves Lessa Schots

    2015-08-01

    Full Text Available Background: Process performance analysis is a key step for implementing continuous improvement in software organizations. However, the knowledge to execute such analysis is not trivial and the person responsible to executing it must be provided with appropriate support. Aim: This paper presents a knowledge-based environment, named SPEAKER, proposed for supporting software organizations during the execution of process performance analysis. SPEAKER comprises a body of knowledge and a set of activities and tasks for software process performance analysis along with supporting tools to executing these activities and tasks. Method: We conducted an informal literature reviews and a systematic mapping study, which provided basic requirements for the proposed environment. We implemented the SPEAKER environment integrating supporting tools for the execution of activities and tasks of performance analysis and the knowledge necessary to execute them, in order to meet the variability presented by the characteristics of these activities. Results: In this paper, we describe each SPEAKER module and the individual evaluations of these modules, and also present an example of use comprising how the environment can guide the user through a specific performance analysis activity. Conclusion: Although we only conducted individual evaluations of SPEAKER’s modules, the example of use indicates the feasibility of the proposed environment. Therefore, the environment as a whole will be further evaluated to verify if it attains its goal of assisting in the execution of process performance analysis by non-specialist people.

  8. Hardware support for software controlled fast reconfiguration of performance counters

    Salapura, Valentina; Wisniewski, Robert W.

    2013-06-18

    Hardware support for software controlled reconfiguration of performance counters may include a plurality of performance counters collecting one or more counts of one or more selected activities. A storage element stores data value representing a time interval, and a timer element reads the data value and detects expiration of the time interval based on the data value and generates a signal. A plurality of configuration registers stores a set of performance counter configurations. A state machine receives the signal and selects a configuration register from the plurality of configuration registers for reconfiguring the one or more performance counters.

  9. Communication Software Performance for Linux Clusters with Mesh Connections

    Jie Chen; William Watson

    2003-09-01

    Recent progress in copper based commodity Gigabit Ethernet interconnects enables constructing clusters to achieve extremely high I/O bandwidth at low cost with mesh connections. However, the TCP/IP protocol stack cannot match the improved performance of Gigabit Ethernet networks especially in the case of multiple interconnects on a single host. In this paper, we evaluate and compare the performance characteristics of TCP/IP and M-VIA software that is an implementation of VIA.In particular, we focus on the performance of the software systems for a mesh communication architecture and demonstrate the feasibility of using multiple Gigabit Ethernet cards on one host to achieve aggregated bandwidth and latency that are not only better than what TCP provides but also compare favorably to some of the special purpose high-speed networks. In addition, implementation of a new M-VIA driver for one type of Gigabit Ethernet card will be discussed.

  10. Performance testing of 3D point cloud software

    M. Varela-González

    2013-10-01

    Full Text Available LiDAR systems are being used widely in recent years for many applications in the engineering field: civil engineering, cultural heritage, mining, industry and environmental engineering. One of the most important limitations of this technology is the large computational requirements involved in data processing, especially for large mobile LiDAR datasets. Several software solutions for data managing are available in the market, including open source suites, however, users often unknown methodologies to verify their performance properly. In this work a methodology for LiDAR software performance testing is presented and four different suites are studied: QT Modeler, VR Mesh, AutoCAD 3D Civil and the Point Cloud Library running in software developed at the University of Vigo (SITEGI. The software based on the Point Cloud Library shows better results in the loading time of the point clouds and CPU usage. However, it is not as strong as commercial suites in working set and commit size tests.

  11. Large scale and performance tests of the ATLAS online software

    Alexandrov; Kotov, V.; Mineev, M.; Roumiantsev, V.; Wolters, H.; Amorim, A.; Pedro, L.; Ribeiro, A.; Badescu, E.; Caprini, M.; Burckhart-Chromek, D.; Dobson, M.; Jones, R.; Kazarov, A.; Kolos, S.; Liko, D.; Lucio, L.; Mapelli, L.; Nassiakou, M.; Schweiger, D.; Soloviev, I.; Hart, R.; Ryabov, Y.; Moneta, L.

    2001-01-01

    One of the sub-systems of the Trigger/DAQ system of the future ATLAS experiment is the Online Software system. It encompasses the functionality needed to configure, control and monitor the DAQ. Its architecture is based on a component structure described in the ATLAS Trigger/DAQ technical proposal. Regular integration tests ensure its smooth operation in test beam setups during its evolutionary development towards the final ATLAS online system. Feedback is received and returned into the development process. Studies of the system behavior have been performed on a set of up to 111 PCs on a configuration which is getting closer to the final size. Large scale and performance test of the integrated system were performed on this setup with emphasis on investigating the aspects of the inter-dependence of the components and the performance of the communication software. Of particular interest were the run control state transitions in various configurations of the run control hierarchy. For the purpose of the tests, the software from other Trigger/DAQ sub-systems has been emulated. The author presents a brief overview of the online system structure, its components and the large scale integration tests and their results

  12. High-Level Synthesis: Productivity, Performance, and Software Constraints

    Yun Liang

    2012-01-01

    Full Text Available FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort for FPGA implementations remains high—often an order of magnitude larger than design effort using high-level languages. Instead of this time-consuming process, high-level synthesis (HLS tools generate hardware implementations from algorithm descriptions in languages such as C/C++ and SystemC. Such tools reduce design effort: high-level descriptions are more compact and less error prone. HLS tools promise hardware development abstracted from software designer knowledge of the implementation platform. In this paper, we present an unbiased study of the performance, usability and productivity of HLS using AutoPilot (a state-of-the-art HLS tool. In particular, we first evaluate AutoPilot using the popular embedded benchmark kernels. Then, to evaluate the suitability of HLS on real-world applications, we perform a case study of stereo matching, an active area of computer vision research that uses techniques also common for image denoising, image retrieval, feature matching, and face recognition. Based on our study, we provide insights on current limitations of mapping general-purpose software to hardware using HLS and some future directions for HLS tool development. We also offer several guidelines for hardware-friendly software design. For popular embedded benchmark kernels, the designs produced by HLS achieve 4X to 126X speedup over the software version. The stereo matching algorithms achieve between 3.5X and 67.9X speedup over software (but still less than manual RTL design with a fivefold reduction in design effort versus manual RTL design.

  13. Software for evaluation of EPR-dosimetry performance

    Shishkina, E.A.; Timofeev, Yu.S.; Ivanov, D.V.

    2014-01-01

    Electron paramagnetic resonance (EPR) with tooth enamel is a method extensively used for retrospective external dosimetry. Different research groups apply different equipment, sample preparation procedures and spectrum processing algorithms for EPR dosimetry. A uniform algorithm for description and comparison of performances was designed and implemented in a new computer code. The aim of the paper is to introduce the new software 'EPR-dosimetry performance'. The computer code is a user-friendly tool for providing a full description of method-specific capabilities of EPR tooth dosimetry, from metrological characteristics to practical limitations in applications. The software designed for scientists and engineers has several applications, including support of method calibration by evaluation of calibration parameters, evaluation of critical value and detection limit for registration of radiation-induced signal amplitude, estimation of critical value and detection limit for dose evaluation, estimation of minimal detectable value for anthropogenic dose assessment and description of method uncertainty. (authors)

  14. Runtime Performance Monitoring Tool for RTEMS System Software

    Cho, B.; Kim, S.; Park, H.; Kim, H.; Choi, J.; Chae, D.; Lee, J.

    2007-08-01

    RTEMS is a commercial-grade real-time operating system that supports multi-processor computers. However, there are not many development tools for RTEMS. In this paper, we report new RTEMS-based runtime performance monitoring tool. We have implemented a light weight runtime monitoring task with an extension to the RTEMS APIs. Using our tool, software developers can verify various performance- related parameters during runtime. Our tool can be used during software development phase and in-orbit operation as well. Our implemented target agent is light weight and has small overhead using SpaceWire interface. Efforts to reduce overhead and to add other monitoring parameters are currently under research.

  15. GPU Linear algebra extensions for GNU/Octave

    Bosi, L B; Mariotti, M; Santocchia, A

    2012-01-01

    Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.

  16. GPU - Accelerated Monte Carlo electron transport methods: development and application for radiation dose calculations using 6 GPU cards

    Su, L.; Du, X.; Liu, T.; Xu, X. G.

    2013-01-01

    An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software test-bed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs (Graphics Processing Units). This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. As for photon part, photoelectric effect, Compton scattering and pair production were simulated. Voxelized geometry was supported. A serial CPU (Central Processing Unit)code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6*10 6 electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy. (authors)

  17. Gpufit: An open-source toolkit for GPU-accelerated curve fitting.

    Przybylski, Adrian; Thiel, Björn; Keller-Findeisen, Jan; Stock, Bernd; Bates, Mark

    2017-11-16

    We present a general purpose, open-source software library for estimation of non-linear parameters by the Levenberg-Marquardt algorithm. The software, Gpufit, runs on a Graphics Processing Unit (GPU) and executes computations in parallel, resulting in a significant gain in performance. We measured a speed increase of up to 42 times when comparing Gpufit with an identical CPU-based algorithm, with no loss of precision or accuracy. Gpufit is designed such that it is easily incorporated into existing applications or adapted for new ones. Multiple software interfaces, including to C, Python, and Matlab, ensure that Gpufit is accessible from most programming environments. The full source code is published as an open source software repository, making its function transparent to the user and facilitating future improvements and extensions. As a demonstration, we used Gpufit to accelerate an existing scientific image analysis package, yielding significantly improved processing times for super-resolution fluorescence microscopy datasets.

  18. Software Systems for High-performance Quantum Computing

    Humble, Travis S [ORNL; Britt, Keith A [ORNL

    2016-01-01

    Quantum computing promises new opportunities for solving hard computational problems, but harnessing this novelty requires breakthrough concepts in the design, operation, and application of computing systems. We define some of the challenges facing the development of quantum computing systems as well as software-based approaches that can be used to overcome these challenges. Following a brief overview of the state of the art, we present models for the quantum programming and execution models, the development of architectures for hybrid high-performance computing systems, and the realization of software stacks for quantum networking. This leads to a discussion of the role that conventional computing plays in the quantum paradigm and how some of the current challenges for exascale computing overlap with those facing quantum computing.

  19. 76 FR 60939 - Metal Fatigue Analysis Performed by Computer Software

    2011-09-30

    ... Software AGENCY: Nuclear Regulatory Commission. ACTION: Regulatory issue summary; request for comment... computer software package, WESTEMS TM , to demonstrate compliance with Section III, ``Rules for... Software Addressees All holders of, and applicants for, a power reactor operating license or construction...

  20. High-performance GPU-based rendering for real-time, rigid 2D/3D-image registration and motion prediction in radiation oncology.

    Spoerk, Jakob; Gendrin, Christelle; Weber, Christoph; Figl, Michael; Pawiro, Supriyanto Ardjo; Furtado, Hugo; Fabri, Daniella; Bloch, Christoph; Bergmann, Helmar; Gröller, Eduard; Birkfellner, Wolfgang

    2012-02-01

    A common problem in image-guided radiation therapy (IGRT) of lung cancer as well as other malignant diseases is the compensation of periodic and aperiodic motion during dose delivery. Modern systems for image-guided radiation oncology allow for the acquisition of cone-beam computed tomography data in the treatment room as well as the acquisition of planar radiographs during the treatment. A mid-term research goal is the compensation of tumor target volume motion by 2D/3D Registration. In 2D/3D registration, spatial information on organ location is derived by an iterative comparison of perspective volume renderings, so-called digitally rendered radiographs (DRR) from computed tomography volume data, and planar reference x-rays. Currently, this rendering process is very time consuming, and real-time registration, which should at least provide data on organ position in less than a second, has not come into existence. We present two GPU-based rendering algorithms which generate a DRR of 512×512 pixels size from a CT dataset of 53 MB size at a pace of almost 100 Hz. This rendering rate is feasible by applying a number of algorithmic simplifications which range from alternative volume-driven rendering approaches - namely so-called wobbled splatting - to sub-sampling of the DRR-image by means of specialized raycasting techniques. Furthermore, general purpose graphics processing unit (GPGPU) programming paradigms were consequently utilized. Rendering quality and performance as well as the influence on the quality and performance of the overall registration process were measured and analyzed in detail. The results show that both methods are competitive and pave the way for fast motion compensation by rigid and possibly even non-rigid 2D/3D registration and, beyond that, adaptive filtering of motion models in IGRT. Copyright © 2011. Published by Elsevier GmbH.

  1. High-performance GPU-based rendering for real-time, rigid 2D/3D-image registration and motion prediction in radiation oncology

    Spoerk, Jakob; Gendrin, Christelle; Weber, Christoph [Medical University of Vienna (Austria). Center of Medical Physics and Biomedical Engineering] [and others

    2012-07-01

    A common problem in image-guided radiation therapy (IGRT) of lung cancer as well as other malignant diseases is the compensation of periodic and aperiodic motion during dose delivery. Modern systems for image-guided radiation oncology allow for the acquisition of cone-beam computed tomography data in the treatment room as well as the acquisition of planar radiographs during the treatment. A mid-term research goal is the compensation of tumor target volume motion by 2D/3D Registration. In 2D/3D registration, spatial information on organ location is derived by an iterative comparison of perspective volume renderings, so-called digitally rendered radiographs (DRR) from computed tomography volume data, and planar reference X-rays. Currently, this rendering process is very time consuming, and real-time registration, which should at least provide data on organ position in less than a second, has not come into existence. We present two GPU-based rendering algorithms which generate a DRR of 512 x 512 pixels size from a CT dataset of 53 MB size at a pace of almost 100 Hz. This rendering rate is feasible by applying a number of algorithmic simplifications which range from alternative volume-driven rendering approaches - namely so-called wobbled splatting - to sub-sampling of the DRR-image by means of specialized raycasting techniques. Furthermore, general purpose graphics processing unit (GPGPU) programming paradigms were consequently utilized. Rendering quality and performance as well as the influence on the quality and performance of the overall registration process were measured and analyzed in detail. The results show that both methods are competitive and pave the way for fast motion compensation by rigid and possibly even non-rigid 2D/3D registration and, beyond that, adaptive filtering of motion models in IGRT. (orig.)

  2. Synergia CUDA: GPU-accelerated accelerator modeling package

    Lu, Q; Amundson, J

    2014-01-01

    Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.

  3. GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

    Takaishi, Tetsuya

    2015-01-01

    The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran

  4. An efficient spectral crystal plasticity solver for GPU architectures

    Malahe, Michael

    2018-03-01

    We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.

  5. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

    Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn [College of Computer Science, National University of Defense Technology, Changsha 410073 (China); Deng, Xiaogang; Zhang, Lilun [College of Computer Science, National University of Defense Technology, Changsha 410073 (China); Fang, Jianbin [Parallel and Distributed Systems Group, Delft University of Technology, Delft 2628CD (Netherlands); Wang, Guangxue; Jiang, Yi [State Key Laboratory of Aerodynamics, P.O. Box 211, Mianyang 621000 (China); Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua [College of Computer Science, National University of Defense Technology, Changsha 410073 (China)

    2014-12-01

    Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations

  6. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

    Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

    2014-01-01

    Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations

  7. Parallelization and checkpointing of GPU applications through program transformation

    Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  8. Real-Time Incompressible Fluid Simulation on the GPU

    Xiao Nie

    2015-01-01

    Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.

  9. Software

    Macedo, R.; Budd, G.; Ross, E.; Wells, P.

    2010-07-15

    The software section of this journal presented new software programs that have been developed to help in the exploration and development of hydrocarbon resources. Software provider IHS Inc. has made additions to its geological and engineering analysis software tool, IHS PETRA, a product used by geoscientists and engineers to visualize, analyze and manage well production, well log, drilling, reservoir, seismic and other related information. IHS PETRA also includes a directional well module and a decline curve analysis module to improve analysis capabilities in unconventional reservoirs. Petris Technology Inc. has developed a software to help manage the large volumes of data. PetrisWinds Enterprise (PWE) helps users find and manage wellbore data, including conventional wireline and MWD core data; analysis core photos and images; waveforms and NMR; and external files documentation. Ottawa-based Ambercore Software Inc. has been collaborating with Nexen on the Petroleum iQ software for steam assisted gravity drainage (SAGD) producers. Petroleum iQ integrates geology and geophysics data with engineering data in 3D and 4D. Calgary-based Envirosoft Corporation has developed a software that reduces the costly and time-consuming effort required to comply with Directive 39 of the Alberta Energy Resources Conservation Board. The product includes an emissions modelling software. Houston-based Seismic Micro-Technology (SMT) has developed the Kingdom software that features the latest in seismic interpretation. Holland-based Joa Oil and Gas and Calgary-based Computer Modelling Group have both supplied the petroleum industry with advanced reservoir simulation software that enables reservoir interpretation. The 2010 software survey included a guide to new software applications designed to facilitate petroleum exploration, drilling and production activities. Oil and gas producers can use the products for a range of functions, including reservoir characterization and accounting. In

  10. Quality Assurance in Software Development: An Exploratory Investigation in Software Project Failures and Business Performance

    Ichu, Emmanuel A.

    2010-01-01

    Software quality is perhaps one of the most sought-after attributes in product development, however; this goal is unattained. Problem factors in software development and how these have affected the maintainability of the delivered software systems requires a thorough investigation. It was, therefore, very important to understand software…

  11. Graphics processing unit (GPU) real-time infrared scene generation

    Christie, Chad L.; Gouthas, Efthimios (Themie); Williams, Owen M.

    2007-04-01

    VIRSuite, the GPU-based suite of software tools developed at DSTO for real-time infrared scene generation, is described. The tools include the painting of scene objects with radiometrically-associated colours, translucent object generation, polar plot validation and versatile scene generation. Special features include radiometric scaling within the GPU and the presence of zoom anti-aliasing at the core of VIRSuite. Extension of the zoom anti-aliasing construct to cover target embedding and the treatment of translucent objects is described.

  12. Gfargo: Fargo for Gpu

    Masset, Frédéric

    2015-09-01

    GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.

  13. Survey of using GPU CUDA programming model in medical image analysis

    T. Kalaiselvi

    2017-01-01

    Full Text Available With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed.

  14. Accelerating three-dimensional FDTD calculations on GPU clusters for electromagnetic field simulation.

    Nagaoka, Tomoaki; Watanabe, Soichi

    2012-01-01

    Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.

  15. Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy

    Changsheng Zhu

    2018-03-01

    Full Text Available In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.

  16. gPGA: GPU Accelerated Population Genetics Analyses.

    Chunbao Zhou

    Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.

  17. The Future of Software Engineering for High Performance Computing

    Pope, G [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

    2015-07-16

    DOE ASCR requested that from May through mid-July 2015 a study group identify issues and recommend solutions from a software engineering perspective transitioning into the next generation of High Performance Computing. The approach used was to ask some of the DOE complex experts who will be responsible for doing this work to contribute to the study group. The technique used was to solicit elevator speeches: a short and concise write up done as if the author was a speaker with only a few minutes to convince a decision maker of their top issues. Pages 2-18 contain the original texts of the contributed elevator speeches and end notes identifying the 20 contributors. The study group also ranked the importance of each topic, and those scores are displayed with each topic heading. A perfect score (and highest priority) is three, two is medium priority, and one is lowest priority. The highest scoring topic areas were software engineering and testing resources; the lowest scoring area was compliance to DOE standards. The following two paragraphs are an elevator speech summarizing the contributed elevator speeches. Each sentence or phrase in the summary is hyperlinked to its source via a numeral embedded in the text. A risk one liner has also been added to each topic to allow future risk tracking and mitigation.

  18. Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations

    Priimak, Dmitri

    2014-01-01

    We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques

  19. Finite difference numerical method for the superlattice Boltzmann transport equation and case comparison of CPU(C) and GPU(CUDA) implementations

    Priimak, Dmitri

    2014-12-01

    We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimisation techniques.

  20. SU-E-T-29: A Web Application for GPU-Based Monte Carlo IMRT/VMAT QA with Delivered Dose Verification

    Folkerts, M; Graves, Y; Tian, Z; Gu, X; Jia, X; Jiang, S

    2014-01-01

    Purpose: To enable an existing web application for GPU-based Monte Carlo (MC) 3D dosimetry quality assurance (QA) to compute “delivered dose” from linac logfile data. Methods: We added significant features to an IMRT/VMAT QA web application which is based on existing technologies (HTML5, Python, and Django). This tool interfaces with python, c-code libraries, and command line-based GPU applications to perform a MC-based IMRT/VMAT QA. The web app automates many complicated aspects of interfacing clinical DICOM and logfile data with cutting-edge GPU software to run a MC dose calculation. The resultant web app is powerful, easy to use, and is able to re-compute both plan dose (from DICOM data) and delivered dose (from logfile data). Both dynalog and trajectorylog file formats are supported. Users upload zipped DICOM RP, CT, and RD data and set the expected statistic uncertainty for the MC dose calculation. A 3D gamma index map, 3D dose distribution, gamma histogram, dosimetric statistics, and DVH curves are displayed to the user. Additional the user may upload the delivery logfile data from the linac to compute a 'delivered dose' calculation and corresponding gamma tests. A comprehensive PDF QA report summarizing the results can also be downloaded. Results: We successfully improved a web app for a GPU-based QA tool that consists of logfile parcing, fluence map generation, CT image processing, GPU based MC dose calculation, gamma index calculation, and DVH calculation. The result is an IMRT and VMAT QA tool that conducts an independent dose calculation for a given treatment plan and delivery log file. The system takes both DICOM data and logfile data to compute plan dose and delivered dose respectively. Conclusion: We sucessfully improved a GPU-based MC QA tool to allow for logfile dose calculation. The high efficiency and accessibility will greatly facilitate IMRT and VMAT QA

  1. SU-E-T-29: A Web Application for GPU-Based Monte Carlo IMRT/VMAT QA with Delivered Dose Verification

    Folkerts, M [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States); University of California, San Diego, La Jolla, CA (United States); Graves, Y [University of California, San Diego, La Jolla, CA (United States); Tian, Z; Gu, X; Jia, X; Jiang, S [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States)

    2014-06-01

    Purpose: To enable an existing web application for GPU-based Monte Carlo (MC) 3D dosimetry quality assurance (QA) to compute “delivered dose” from linac logfile data. Methods: We added significant features to an IMRT/VMAT QA web application which is based on existing technologies (HTML5, Python, and Django). This tool interfaces with python, c-code libraries, and command line-based GPU applications to perform a MC-based IMRT/VMAT QA. The web app automates many complicated aspects of interfacing clinical DICOM and logfile data with cutting-edge GPU software to run a MC dose calculation. The resultant web app is powerful, easy to use, and is able to re-compute both plan dose (from DICOM data) and delivered dose (from logfile data). Both dynalog and trajectorylog file formats are supported. Users upload zipped DICOM RP, CT, and RD data and set the expected statistic uncertainty for the MC dose calculation. A 3D gamma index map, 3D dose distribution, gamma histogram, dosimetric statistics, and DVH curves are displayed to the user. Additional the user may upload the delivery logfile data from the linac to compute a 'delivered dose' calculation and corresponding gamma tests. A comprehensive PDF QA report summarizing the results can also be downloaded. Results: We successfully improved a web app for a GPU-based QA tool that consists of logfile parcing, fluence map generation, CT image processing, GPU based MC dose calculation, gamma index calculation, and DVH calculation. The result is an IMRT and VMAT QA tool that conducts an independent dose calculation for a given treatment plan and delivery log file. The system takes both DICOM data and logfile data to compute plan dose and delivered dose respectively. Conclusion: We sucessfully improved a GPU-based MC QA tool to allow for logfile dose calculation. The high efficiency and accessibility will greatly facilitate IMRT and VMAT QA.

  2. Software on the Peregrine System | High-Performance Computing | NREL

    on the Peregrine System Software on the Peregrine System NREL maintains a variety of applications environment modules for use on Peregrine. Applications View list of software applications by name and research area/discipline. Libraries View list of software libraries available for linking and loading

  3. Performing Verification and Validation in Reuse-Based Software Engineering

    Addy, Edward A.

    1999-01-01

    The implementation of reuse-based software engineering not only introduces new activities to the software development process, such as domain analysis and domain modeling, it also impacts other aspects of software engineering. Other areas of software engineering that are affected include Configuration Management, Testing, Quality Control, and Verification and Validation (V&V). Activities in each of these areas must be adapted to address the entire domain or product line rather than a specific application system. This paper discusses changes and enhancements to the V&V process, in order to adapt V&V to reuse-based software engineering.

  4. Accelerated finite element elastodynamic simulations using the GPU

    Huthwaite, Peter, E-mail: p.huthwaite@imperial.ac.uk

    2014-01-15

    An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source.

  5. Accelerated finite element elastodynamic simulations using the GPU

    Huthwaite, Peter

    2014-01-01

    An approach is developed to perform explicit time domain finite element simulations of elastodynamic problems on the graphical processing unit, using Nvidia's CUDA. Of critical importance for this problem is the arrangement of nodes in memory, allowing data to be loaded efficiently and minimising communication between the independently executed blocks of threads. The initial stage of memory arrangement is partitioning the mesh; both a well established ‘greedy’ partitioner and a new, more efficient ‘aligned’ partitioner are investigated. A method is then developed to efficiently arrange the memory within each partition. The software is applied to three models from the fields of non-destructive testing, vibrations and geophysics, demonstrating a memory bandwidth of very close to the card's maximum, reflecting the bandwidth-limited nature of the algorithm. Comparison with Abaqus, a widely used commercial CPU equivalent, validated the accuracy of the results and demonstrated a speed improvement of around two orders of magnitude. A software package, Pogo, incorporating these developments, is released open source, downloadable from (http://www.pogo-fea.com/) to benefit the community. -- Highlights: •A novel memory arrangement approach is discussed for finite elements on the GPU. •The mesh is partitioned then nodes are arranged efficiently within each partition. •Models from ultrasonics, vibrations and geophysics are run. •The code is significantly faster than an equivalent commercial CPU package. •Pogo, the new software package, is released open source

  6. Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

    Zhou Zhe

    2017-04-01

    Full Text Available According to previous reports, information could be leaked from GPU memory; however, the security implications of such a threat were mostly over-looked, because only limited information could be indirectly extracted through side-channel attacks. In this paper, we propose a novel algorithm for recovering raw data directly from the GPU memory residues of many popular applications such as Google Chrome and Adobe PDF reader. Our algorithm enables harvesting highly sensitive information including credit card numbers and email contents from GPU memory residues. Evaluation results also indicate that nearly all GPU-accelerated applications are vulnerable to such attacks, and adversaries can launch attacks without requiring any special privileges both on traditional multi-user operating systems, and emerging cloud computing scenarios.

  7. GPU Implementation of High Rayleigh Number Three-Dimensional Mantle Convection

    Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.

    2010-12-01

    Although we have entered the age of petascale computing, many factors are still prohibiting high-performance computing (HPC) from infiltrating all suitable scientific disciplines. For this reason and others, application of GPU to HPC is gaining traction in the scientific world. With its low price point, high performance potential, and competitive scalability, GPU has been an option well worth considering for the last few years. Moreover with the advent of NVIDIA's Fermi architecture, which brings ECC memory, better double-precision performance, and more RAM to GPU, there is a strong message of corporate support for GPU in HPC. However many doubts linger concerning the practicality of using GPU for scientific computing. In particular, GPU has a reputation for being difficult to program and suitable for only a small subset of problems. Although inroads have been made in addressing these concerns, for many scientists GPU still has hurdles to clear before becoming an acceptable choice. We explore the applicability of GPU to geophysics by implementing a three-dimensional, second-order finite-difference model of Rayleigh-Benard thermal convection on an NVIDIA GPU using C for CUDA. Our code reaches sufficient resolution, on the order of 500x500x250 evenly-spaced finite-difference gridpoints, on a single GPU. We make extensive use of highly optimized CUBLAS routines, allowing us to achieve performance on the order of O( 0.1 ) µs per timestep*gridpoint at this resolution. This performance has allowed us to study high Rayleigh number simulations, on the order of 2x10^7, on a single GPU.

  8. cellGPU: Massively parallel simulations of dynamic vertex models

    Sussman, Daniel M.

    2017-10-01

    Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation

  9. Automated Improvement of Software Architecture Models for Performance and Other Quality Attributes

    Koziolek, Anne

    2013-01-01

    Quality attributes, such as performance or reliability, are crucial for the success of a software system and largely influenced by the software architecture. Their quantitative prediction supports systematic, goal-oriented software design and forms a base of an engineering approach to software design. This thesis proposes a method and tool to automatically improve component-based software architecture (CBA) models based on such quantitative quality prediction techniques.

  10. Development of parallel GPU based algorithms for problems in nuclear area; Desenvolvimento de algoritmos paralelos baseados em GPU para solucao de problemas na area nuclear

    Almeida, Adino Americo Heimlich

    2009-07-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)

  11. ALICE HLT high speed tracking on GPU

    Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre

    2011-01-01

    The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...

  12. Advantages of GPU technology in DFT calculations of intercalated graphene

    Pešić, J.; Gajić, R.

    2014-09-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an

  13. Advantages of GPU technology in DFT calculations of intercalated graphene

    Pešić, J; Gajić, R

    2014-01-01

    Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an

  14. LDPC Decoding on GPU for Mobile Device

    Yiqin Lu

    2016-01-01

    Full Text Available A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.

  15. Using the CPU and GPU for real-time video enhancement on a mobile computer

    Bachoo, AK

    2010-09-01

    Full Text Available . In this paper, the current advances in mobile CPU and GPU hardware are used to implement video enhancement algorithms in a new way on a mobile computer. Both the CPU and GPU are used effectively to achieve realtime performance for complex image enhancement...

  16. A multi-GPU implementation of a D2Q37 lattice Boltzmann code

    Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Scagliarini, Andrea; Schifano, S.F.; Toschi, F.; Tripiccione, R.; Wyrzykowski, R.; Dongarra, J.; Karczewski, K.; Wasniewski, J.

    2012-01-01

    We describe a parallel implementation of a compressible Lattice Boltzmann code on a multi-GPU cluster based on Nvidia Fermi processors. We analyze how to optimize the algorithm for GP-GPU architectures, describe the implementation choices that we have adopted and compare our performance results with

  17. Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software.

    Lawlor, Brendan; Walsh, Paul

    2015-01-01

    There is a lack of software engineering skills in bioinformatic contexts. We discuss the consequences of this lack, examine existing explanations and remedies to the problem, point out their shortcomings, and propose alternatives. Previous analyses of the problem have tended to treat the use of software in scientific contexts as categorically different from the general application of software engineering in commercial settings. In contrast, we describe bioinformatic software engineering as a specialization of general software engineering, and examine how it should be practiced. Specifically, we highlight the difference between programming and software engineering, list elements of the latter and present the results of a survey of bioinformatic practitioners which quantifies the extent to which those elements are employed in bioinformatics. We propose that the ideal way to bring engineering values into research projects is to bring engineers themselves. We identify the role of Bioinformatic Engineer and describe how such a role would work within bioinformatic research teams. We conclude by recommending an educational emphasis on cross-training software engineers into life sciences, and propose research on Domain Specific Languages to facilitate collaboration between engineers and bioinformaticians.

  18. Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software

    Lawlor, Brendan; Walsh, Paul

    2015-01-01

    There is a lack of software engineering skills in bioinformatic contexts. We discuss the consequences of this lack, examine existing explanations and remedies to the problem, point out their shortcomings, and propose alternatives. Previous analyses of the problem have tended to treat the use of software in scientific contexts as categorically different from the general application of software engineering in commercial settings. In contrast, we describe bioinformatic software engineering as a specialization of general software engineering, and examine how it should be practiced. Specifically, we highlight the difference between programming and software engineering, list elements of the latter and present the results of a survey of bioinformatic practitioners which quantifies the extent to which those elements are employed in bioinformatics. We propose that the ideal way to bring engineering values into research projects is to bring engineers themselves. We identify the role of Bioinformatic Engineer and describe how such a role would work within bioinformatic research teams. We conclude by recommending an educational emphasis on cross-training software engineers into life sciences, and propose research on Domain Specific Languages to facilitate collaboration between engineers and bioinformaticians. PMID:25996054

  19. A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

    Nagaoka, Tomoaki; Watanabe, Soichi

    2010-01-01

    Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.

  20. Incompressible SPH (ISPH) with fast Poisson solver on a GPU

    Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

    2018-05-01

    This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.

  1. a method of gravity and seismic sequential inversion and its GPU implementation

    Liu, G.; Meng, X.

    2011-12-01

    In this abstract, we introduce a gravity and seismic sequential inversion method to invert for density and velocity together. For the gravity inversion, we use an iterative method based on correlation imaging algorithm; for the seismic inversion, we use the full waveform inversion. The link between the density and velocity is an empirical formula called Gardner equation, for large volumes of data, we use the GPU to accelerate the computation. For the gravity inversion method , we introduce a method based on correlation imaging algorithm,it is also a interative method, first we calculate the correlation imaging of the observed gravity anomaly, it is some value between -1 and +1, then we multiply this value with a little density ,this value become the initial density model. We get a forward reuslt with this initial model and also calculate the correaltion imaging of the misfit of observed data and the forward data, also multiply the correaltion imaging result a little density and add it to the initial model, then do the same procedure above , at last ,we can get a inversion density model. For the seismic inveron method ,we use a mothod base on the linearity of acoustic wave equation written in the frequency domain,with a intial velociy model, we can get a good velocity result. In the sequential inversion of gravity and seismic , we need a link formula to convert between density and velocity ,in our method , we use the Gardner equation. Driven by the insatiable market demand for real time, high-definition 3D images, the programmable NVIDIA Graphic Processing Unit (GPU) as co-processor of CPU has been developed for high performance computing. Compute Unified Device Architecture (CUDA) is a parallel programming model and software environment provided by NVIDIA designed to overcome the challenge of using traditional general purpose GPU while maintaining a low learn curve for programmers familiar with standard programming languages such as C. In our inversion processing

  2. Strategies for regular segmented reductions on GPU

    Larsen, Rasmus Wriedt; Henriksen, Troels

    2017-01-01

    We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby exhibiting bad performance for certain input...... is in the context of the Futhark compiler, the implementation technique is applicable to any library or language that has a need for segmented reductions. We evaluate the technique on four microbenchmarks, two of which we also compare to implementations in the CUB library for GPU programming, as well as on two...

  3. [Software for performing a global phenotypic and genotypic nutritional assessment].

    García de Diego, L; Cuervo, M; Martínez, J A

    2013-01-01

    The nutritional assessment of a patient needs the simultaneous managing a extensive information and a great number of databases, as both aspects of the process of nutrition and the clinical situation of the patient are analyzed. The introduction of computers in the nutritional area constitutes an extraordinary advance in the administration of nutrition information, providing a complete assessment of nutritional aspects in a quick and easy way. To develop a computer program that can be used as a tool for assessing the nutritional status of the patient, the education of clinical staff, for epidemiological studies and for educational purposes. Based on a computer program which assists the health specialist to perform a full nutritional evaluation of the patient, through the registration and assessment of the phenotypic and genotypic features. The application provides nutritional prognosis based on anthropometric and biochemical parameters, images of states of malnutrition, questionnaires to characterize diseases, diagnostic criteria, identification of alleles associated with the development of specific metabolic illnesses and questionnaires of quality of life, for a custom actuation. The program includes, as part of the nutritional assessment of the patient, food intake analysis, design of diets and promotion of physical activity, introducing food frequency questionnaires, dietary recalls, healthy eating indexes, model diets, fitness tests, and recommendations, recalls and questionnaires of physical activity. A computer program performed under Java Swing, using SQLite database and some external libraries such as JfreeChart for plotting graphs. This brand new designed software is composed of five blocks categorized into ten modules named: Patients, Anthropometry, Clinical History, Biochemistry, Dietary History, Diagnostic (with genetic make up), Quality of life, Physical activity, Energy expenditure and Diets. Each module has a specific function which evaluates a

  4. Evaluation of speedup of Monte Carlo calculations of two simple reactor physics problems coded for the GPU/CUDA environment

    Ding, Aiping; Liu, Tianyu; Liang, Chao; Ji, Wei; Shephard, Mark S.; Xu, X George; Brown, Forrest B.

    2011-01-01

    Monte Carlo simulation is ideally suited for solving Boltzmann neutron transport equation in inhomogeneous media. However, routine applications require the computation time to be reduced to hours and even minutes in a desktop system. The interest in adopting GPUs for Monte Carlo acceleration is rapidly mounting, fueled partially by the parallelism afforded by the latest GPU technologies and the challenge to perform full-size reactor core analysis on a routine basis. In this study, Monte Carlo codes for a fixed-source neutron transport problem and an eigenvalue/criticality problem were developed for CPU and GPU environments, respectively, to evaluate issues associated with computational speedup afforded by the use of GPUs. The results suggest that a speedup factor of 30 in Monte Carlo radiation transport of neutrons is within reach using the state-of-the-art GPU technologies. However, for the eigenvalue/criticality problem, the speedup was 8.5. In comparison, for a task of voxelizing unstructured mesh geometry that is more parallel in nature, the speedup of 45 was obtained. It was observed that, to date, most attempts to adopt GPUs for Monte Carlo acceleration were based on naïve implementations and have not yielded the level of anticipated gains. Successful implementation of Monte Carlo schemes for GPUs will likely require the development of an entirely new code. Given the prediction that future-generation GPU products will likely bring exponentially improved computing power and performances, innovative hardware and software solutions may make it possible to achieve full-core Monte Carlo calculation within one hour using a desktop computer system in a few years. (author)

  5. Accelerating large-scale phase-field simulations with GPU

    Xiaoming Shi

    2017-10-01

    Full Text Available A new package for accelerating large-scale phase-field simulations was developed by using GPU based on the semi-implicit Fourier method. The package can solve a variety of equilibrium equations with different inhomogeneity including long-range elastic, magnetostatic, and electrostatic interactions. Through using specific algorithm in Compute Unified Device Architecture (CUDA, Fourier spectral iterative perturbation method was integrated in GPU package. The Allen-Cahn equation, Cahn-Hilliard equation, and phase-field model with long-range interaction were solved based on the algorithm running on GPU respectively to test the performance of the package. From the comparison of the calculation results between the solver executed in single CPU and the one on GPU, it was found that the speed on GPU is enormously elevated to 50 times faster. The present study therefore contributes to the acceleration of large-scale phase-field simulations and provides guidance for experiments to design large-scale functional devices.

  6. GPU-Accelerated Text Mining

    Cui, X.; Mueller, F.; Zhang, Y.; Potok, Thomas E.

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  7. Technical Performance Assessment: Mission Success in Software Acquisition Management

    2010-04-27

    Examples Design constraints make software acquisition and development t l iti lex reme y cr ca Application domain – Operational Flight Program, Air...environment – used to produce the software Ri k t t bli h d d i t i d i k ts managemen – es a s e an ma n a ne r s managemen systems Milestone reviews...Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that

  8. Improving Performance of Software Implemented Floating Point Addition

    Hindborg, Andreas Erik; Karlsson, Sven

    2011-01-01

    We outline and evaluate hardware extensions to an integer processor pipeline which allow IEEE 754 oating point, FP, addition to be eciently implemented in software. With a very moderate increase in hardware resources, our perfor- mance evaluation shows that, for a benchmark that executes 12.5% FP...... addition instructions, our approach exhibits a rel- ative slowdown of 3.38 to 15.15 as compared to dedicated hardware. This is a signicant improvement of pure software emulation which leads to relative slowdowns up to 45.33....

  9. Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems

    Yong Chen

    2017-01-01

    Full Text Available Static state security analysis (SSSA is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.

  10. Development of parallel GPU based algorithms for problems in nuclear area

    Almeida, Adino Americo Heimlich

    2009-01-01

    Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)

  11. Research on GPU acceleration for Monte Carlo criticality calculation

    Xu, Q.; Yu, G.; Wang, K.

    2013-01-01

    The Monte Carlo (MC) neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The 'neutron transport step' is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the 'neutron transport step' strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs. (authors)

  12. Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.

    Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A

    2015-06-01

    Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU).

  13. Application of GPU to Multi-interfaces Advection and Reconstruction Solver (MARS)

    Nagatake, Taku; Takase, Kazuyuki; Kunugi, Tomoaki

    2010-01-01

    In the nuclear engineering fields, a high performance computer system is necessary to perform the large scale computations. Recently, a Graphics Processing Unit (GPU) has been developed as a rendering computational system in order to reduce a Central Processing Unit (CPU) load. In the graphics processing, the high performance computing is needed to render the high-quality 3D objects in some video games. Thus the GPU consists of many processing units and a wide memory bandwidth. In this study, the Multi-interfaces Advection and Reconstruction Solver (MARS) which is one of the interface volume tracking methods for multi-phase flows has been performed. The multi-phase flow computation is very important for the nuclear reactors and other engineering fields. The MARS consists of two computing parts: the interface tracking part and the fluid motion computing part. As for the interface tracking part, the performance of GPU (GTX280) was 6 times faster than that of the CPU (Dual-Xeon 5040), and in the fluid motion computing part the Poisson Solver by the GPU (GTX285) was 22 times faster than that by the CPU(Core i7). As for the Dam Breaking Problem, the result of GPU-MARS showed slightly different from the experimental result. Because the GPU-MARS was developed using the single-precision GPU, it can be considered that the round-off error might be accumulated. (author)

  14. Design, Implementation, and Performance of CREAM Data Acquisition Software

    Zinn, S Y; Bagliesi, M G; Beatty, J J; Childers, J T; Coutu, S; Duvernois, M A; Ganel, O; Kim, H J; Lee, M H; Lutz, L; Malinine, A; Maestro, P; Marrocchesi, P S; Park, I H; Seo, E S; Song, C; Swordy, S; Wu, J

    2005-01-01

    Cosmic Ray Energetics and Mass (CREAM) is a balloon-borne experiment scheduled for launching from Antarctica in late 2004. Its aim is to measure the energy spectrum and composition of cosmic rays from proton to iron nuclei at ultra high energies from 1 to 1,000 TeV. Ultra long duration balloons are expected to fly about 100 days. One special feature of the CREAM data acquisition software (CDAQ) is the telemetric operation of the instrument using satellites. During a flight the science event and housekeeping data are sent from the instrument to a ground facility. Likewise, commands for controlling both the hardware and the software are uploaded from the ground facility. This requires a robust, reliable, and fast software system. CDAQ has been developed and tested during three beam tests at CERN in July, September, and November 2003. Recently the interfaces to the transition radiation detector (TRD) and to the timing-based charge detector (TCD) have been added. These new additions to CDAQ will be checked at a t...

  15. A Data Specification for Software Project Performance Measures: Results of a Collaboration on Performance Measurement

    Kasunic, Mark

    2008-01-01

    ... between completed projects. These terms and definitions were developed using a collaborative, consensus-based approach involving the Software Engineering Institute's Software Engineering Process Management program and service...

  16. The new CERN tape software - getting ready for total performance

    Cano, E; Kruse, D F; Kotlyar, V; Côme, D

    2015-01-01

    CASTOR (the CERN Advanced STORage system) is used to store the custodial copy of all of the physics data collected from the CERN experiments, both past and present. CASTOR is a hierarchical storage management system that has a disk-based front-end and a tape-based back-end. The software responsible for controlling the tape back-end has been redesigned and redeveloped over the last year and was put in production at the beginning of 2015. This paper summarises the motives behind the redesign, describes in detail the redevelopment work and concludes with the short and long-term benefits.

  17. Frequency Estimator Performance for a Software-Based Beacon Receiver

    Zemba, Michael J.; Morse, Jacquelynne Rose; Nessel, James A.; Miranda, Felix

    2014-01-01

    As propagation terminals have evolved, their design has trended more toward a software-based approach that facilitates convenient adjustment and customization of the receiver algorithms. One potential improvement is the implementation of a frequency estimation algorithm, through which the primary frequency component of the received signal can be estimated with a much greater resolution than with a simple peak search of the FFT spectrum. To select an estimator for usage in a QV-band beacon receiver, analysis of six frequency estimators was conducted to characterize their effectiveness as they relate to beacon receiver design.

  18. Assessing students' performance in software requirements engineering education using scoring rubrics

    Mkpojiogu, Emmanuel O. C.; Hussain, Azham

    2017-10-01

    The study investigates how helpful the use of scoring rubrics is, in the performance assessment of software requirements engineering students and whether its use can lead to students' performance improvement in the development of software requirements artifacts and models. Scoring rubrics were used by two instructors to assess the cognitive performance of a student in the design and development of software requirements artifacts. The study results indicate that the use of scoring rubrics is very helpful in objectively assessing the performance of software requirements or software engineering students. Furthermore, the results revealed that the use of scoring rubrics can also produce a good achievement assessments direction showing whether a student is either improving or not in a repeated or iterative assessment. In a nutshell, its use leads to the performance improvement of students. The results provided some insights for further investigation and will be beneficial to researchers, requirements engineers, system designers, developers and project managers.

  19. Software development for simplified performance tests and weekly performance check in Younggwang NPP Unit 3 and 4

    Hur, K. Y.; Jang, S. H.; Lee, J. W.; Kim, J. T.; Park, J. C.

    2002-01-01

    This paper covers the current status of turbine cycle performance test in nuclear power plants and the software development which can solve some shortcomings related to performance tests. The software developed is for simplified performance tests and weekly performance checks in Yonggwang nuclear power plant unit 3 and 4. This software includes the requirements from the efficiency division for the consistency with actual performance analysis work and the usability of the collected performance test data. From the working survey, we identify the difference between the embedded performance analysis modules and the actual performance analysis work. This software helps operation or maintenance personnel to reduce work load, to support the trend analysis of essential parameters in a turbine cycle, and to utilize the correction curves for the decision-making in their work

  20. GPU-based cone beam computed tomography.

    Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

    2010-06-01

    The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.

  1. GPU-accelerated denoising of 3D magnetic resonance images

    Howison, Mark; Wes Bethel, E.

    2014-05-29

    The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.

  2. GPU accelerated likelihoods for stereo-based articulated tracking

    Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny

    2010-01-01

    than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds....

  3. Optimizing the Performance of Radionuclide Identification Software in the Hunt for Nuclear Security Threats

    Fotion, Katherine A.

    2016-01-01

    The Radionuclide Analysis Kit (RNAK), my team's most recent nuclide identification software, is entering the testing phase. A question arises: will removing rare nuclides from the software's library improve its overall performance? An affirmative response indicates fundamental errors in the software's framework, while a negative response confirms the effectiveness of the software's key machine learning algorithms. After thorough testing, I found that the performance of RNAK cannot be improved with the library choice effect, thus verifying the effectiveness of RNAK's algorithms - multiple linear regression, Bayesian network using the Viterbi algorithm, and branch and bound search.

  4. High-performance computing on GPUs for resistivity logging of oil and gas wells

    Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

    2017-10-01

    We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.

  5. Balancing technical and regulatory concerns related to testing and control of performance assessment software

    Seitz, R.R.; Matthews, S.D.; Kostelnik, K.M.

    1990-01-01

    What activities are required to assure that a performance assessment (PA) computer code operates as it is intended? Answers to this question will vary depending on the individual's area of expertise. Different perspectives on testing and control of PA software are discussed based on interpretations of the testing and control process associated with the different involved parties. This discussion leads into the presentation of a general approach to software testing and control that address regulatory requirements. Finally, the need for balance between regulatory and scientific concerns is illustrated through lessons learned in previous implementations of software testing and control programs. Configuration control and software testing are required to provide assurance that a computer code performs as intended. Configuration control provides traceability and reproducibility of results produced with PA software and provides a system to assure that users have access to the current version of the software. Software testing is conducted to assure that the computer code has been written properly, solution techniques have been properly implemented, and the software is capable of representing the behavior of the specific system to be modeled. Comprehensive software testing includes: software analysis, verification testing, benchmark testing, and site-specific calibration/validation testing

  6. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

    Fan Zhang

    2016-04-01

    Full Text Available With the development of synthetic aperture radar (SAR technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO. However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.

  7. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

    Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

    2016-04-07

    With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.

  8. Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card

    Jiang, Jinpeng; Zhu, Peimin

    2018-05-01

    Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.

  9. ODYSSEY: A PUBLIC GPU-BASED CODE FOR GENERAL RELATIVISTIC RADIATIVE TRANSFER IN KERR SPACETIME

    Pu, Hung-Yi [Institute of Astronomy and Astrophysics, Academia Sinica, 11F of Astronomy-Mathematics Building, AS/NTU No. 1, Taipei 10617, Taiwan (China); Yun, Kiyun; Yoon, Suk-Jin [Department of Astronomy and Center for Galaxy Evolution Research, Yonsei University, Seoul 120-749 (Korea, Republic of); Younsi, Ziri [Institut für Theoretische Physik, Max-von-Laue-Straße 1, D-60438 Frankfurt am Main (Germany)

    2016-04-01

    General relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime are an essential tool for determining the images, spectra, and light curves from matter in the vicinity of black holes. Such studies are especially important for ongoing and upcoming millimeter/submillimeter very long baseline interferometry observations of the supermassive black holes at the centers of Sgr A* and M87. To this end we introduce Odyssey, a graphics processing unit (GPU) based code for ray tracing and radiative transfer in the Kerr spacetime. On a single GPU, the performance of Odyssey can exceed 1 ns per photon, per Runge–Kutta integration step. Odyssey is publicly available, fast, accurate, and flexible enough to be modified to suit the specific needs of new users. Along with a Graphical User Interface powered by a video-accelerated display architecture, we also present an educational software tool, Odyssey-Edu, for showing in real time how null geodesics around a Kerr black hole vary as a function of black hole spin and angle of incidence onto the black hole.

  10. Measuring CMS Software Performance in the first years of LHC collisions

    Benelli, Gabriele; Pfeiffer, Andreas; Piparo, Danilo; Zemleris, Vidmantas

    2011-01-01

    The CMSSW software framework is a complex project enabling the CMS collaboration to investigate the fast growing LHC collision data sample. A software performance suite of tools has been developed and integrated in CMSSW to keep track of cpu time, memory footprint and event size on disk. These three metrics are key constraints in software development in order to meet the computing requirements used in the planning and management of the CMS computing infrastructure. The performance suite allows the measurement and tracking of the performance across the framework, publishing the results in a dedicated database. A web application makes the results easily accessible to software release managers allowing for automatic integration in CMSSW release cycle quality assurance. The performance suite is also available to individual developers for dedicated code optimization and the web application allows historic regression and comparisons across releases. The performance suite tools and the performance of the CMSSW frame...

  11. Using applicative software and software tools for the performance of leaching and bio-leaching

    Krstev, Boris; Krstev, Aleksandar; Gocev, Zivko; Zdravev, Zoran; Krstev, Dejan; Zivanovic, Jordan

    2013-01-01

    The refractory or low grade lead/zinc domestic ores in Republic of Macedonia are investigated by conventional separation technology or flotation separation. In the mean time, investigations are directed to the new possibilities of leaching by microorganisms – bioleaching. The paper is result of these technologies and investigations carried out for recovery of in the mentioned ores. Using Simplex EVOP and computer program Multisimplex performances are appropriate and most acceptable and exce...

  12. Optimizing a mobile robot control system using GPU acceleration

    Tuck, Nat; McGuinness, Michael; Martin, Fred

    2012-01-01

    This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.

  13. STEM image simulation with hybrid CPU/GPU programming

    Yao, Y.; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

    2016-01-01

    STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.

  14. STEM image simulation with hybrid CPU/GPU programming

    Yao, Y., E-mail: yaoyuan@iphy.ac.cn; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

    2016-07-15

    STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.

  15. The Ettention software package.

    Dahmen, Tim; Marsalek, Lukas; Marniok, Nico; Turoňová, Beata; Bogachev, Sviatoslav; Trampert, Patrick; Nickels, Stefan; Slusallek, Philipp

    2016-02-01

    We present a novel software package for the problem "reconstruction from projections" in electron microscopy. The Ettention framework consists of a set of modular building-blocks for tomographic reconstruction algorithms. The well-known block iterative reconstruction method based on Kaczmarz algorithm is implemented using these building-blocks, including adaptations specific to electron tomography. Ettention simultaneously features (1) a modular, object-oriented software design, (2) optimized access to high-performance computing (HPC) platforms such as graphic processing units (GPU) or many-core architectures like Xeon Phi, and (3) accessibility to microscopy end-users via integration in the IMOD package and eTomo user interface. We also provide developers with a clean and well-structured application programming interface (API) that allows for extending the software easily and thus makes it an ideal platform for algorithmic research while hiding most of the technical details of high-performance computing. Copyright © 2015 Elsevier B.V. All rights reserved.

  16. Performance Evaluation of a Software Engineering Tool for Automated Design of Cooling Systems in Injection Moulding

    Jauregui-Becker, Juan M.; Tosello, Guido; van Houten, Fred J.A.M.

    2013-01-01

    This paper presents a software tool for automating the design of cooling systems for injection moulding and a validation of its performance. Cooling system designs were automatically generated by the proposed software tool and by applying a best practice tool engineering design approach. The two...

  17. Comparative Performance Analysis of Machine Learning Techniques for Software Bug Detection

    Saiqa Aleem; Luiz Fernando Capretz; Faheem Ahmed

    2015-01-01

    Machine learning techniques can be used to analyse data from different perspectives and enable developers to retrieve useful information. Machine learning techniques are proven to be useful in terms of software bug prediction. In this paper, a comparative performance analysis of different machine learning techniques is explored f or software bug prediction on public available data sets. Results showed most of the mac ...

  18. The Effect of Firm Strategy and Corporate Performance on Software Market Growth in Emerging Regions

    Mertz, Sharon A.

    2013-01-01

    The purpose of this research is to evaluate the impact of firm strategies and corporate performance on enterprise software market growth in emerging regions. The emerging regions of Asia Pacific, Eastern Europe, the Middle East and Africa, and Latin America, currently represent smaller overall markets for software vendors, but exhibit high growth…

  19. Hardware support for software controlled fast multiplexing of performance counters

    Salapura, Valentina; Wisniewski, Robert W.

    2013-01-01

    Performance counters may be operable to collect one or more counts of one or more selected activities, and registers may be operable to store a set of performance counter configurations. A state machine may be operable to automatically select a register from the registers for reconfiguring the one or more performance counters in response to receiving a first signal. The state machine may be further operable to reconfigure the one or more performance counters based on a configuration specified in the selected register. The state machine yet further may be operable to copy data in selected one or more of the performance counters to a memory location, or to copy data from the memory location to the counters, in response to receiving a second signal. The state machine may be operable to store or restore the counter values and state machine configuration in response to a context switch event.

  20. Accelerating the XGBoost algorithm using GPU computing

    Rory Mitchell

    2017-07-01

    Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.

  1. Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST

    Zhu, You-Liang; Pan, Deng; Li, Zhan-Wei; Liu, Hong; Qian, Hu-Jun; Zhao, Yang; Lu, Zhong-Yuan; Sun, Zhao-Yan

    2018-04-01

    We describe the algorithm of employing multi-GPU power on the basis of Message Passing Interface (MPI) domain decomposition in a molecular dynamics code, GALAMOST, which is designed for the coarse-grained simulation of soft matters. The code of multi-GPU version is developed based on our previous single-GPU version. In multi-GPU runs, one GPU takes charge of one domain and runs single-GPU code path. The communication between neighbouring domains takes a similar algorithm of CPU-based code of LAMMPS, but is optimised specifically for GPUs. We employ a memory-saving design which can enlarge maximum system size at the same device condition. An optimisation algorithm is employed to prolong the update period of neighbour list. We demonstrate good performance of multi-GPU runs on the simulation of Lennard-Jones liquid, dissipative particle dynamics liquid, polymer and nanoparticle composite, and two-patch particles on workstation. A good scaling of many nodes on cluster for two-patch particles is presented.

  2. GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium

    Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey

    2013-12-01

    This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].

  3. Ultrafast convolution/superposition using tabulated and exponential kernels on GPU

    Chen Quan; Chen Mingli; Lu Weiguo [TomoTherapy Inc., 1240 Deming Way, Madison, Wisconsin 53717 (United States)

    2011-03-15

    Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.

  4. Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms

    Valeria Cardellini

    2014-01-01

    Full Text Available We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.

  5. GPU Parallel Bundle Block Adjustment

    ZHENG Maoteng

    2017-09-01

    Full Text Available To deal with massive data in photogrammetry, we introduce the GPU parallel computing technology. The preconditioned conjugate gradient and inexact Newton method are also applied to decrease the iteration times while solving the normal equation. A brand new workflow of bundle adjustment is developed to utilize GPU parallel computing technology. Our method can avoid the storage and inversion of the big normal matrix, and compute the normal matrix in real time. The proposed method can not only largely decrease the memory requirement of normal matrix, but also largely improve the efficiency of bundle adjustment. It also achieves the same accuracy as the conventional method. Preliminary experiment results show that the bundle adjustment of a dataset with about 4500 images and 9 million image points can be done in only 1.5 minutes while achieving sub-pixel accuracy.

  6. The GPU implementation of micro - Doppler period estimation

    Yang, Liyuan; Wang, Junling; Bi, Ran

    2018-03-01

    Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.

  7. GPU PRO 3 Advanced rendering techniques

    Engel, Wolfgang

    2012-01-01

    GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha

  8. Fast Simulation of Dynamic Ultrasound Images Using the GPU.

    Storve, Sigurd; Torp, Hans

    2017-10-01

    Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.

  9. Interactive Light Stimulus Generation with High Performance Real-Time Image Processing and Simple Scripting

    László Szécsi

    2017-12-01

    Full Text Available Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software, which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations.

  10. Interactive Light Stimulus Generation with High Performance Real-Time Image Processing and Simple Scripting.

    Szécsi, László; Kacsó, Ágota; Zeck, Günther; Hantz, Péter

    2017-01-01

    Light stimulation with precise and complex spatial and temporal modulation is demanded by a series of research fields like visual neuroscience, optogenetics, ophthalmology, and visual psychophysics. We developed a user-friendly and flexible stimulus generating framework (GEARS GPU-based Eye And Retina Stimulation Software), which offers access to GPU computing power, and allows interactive modification of stimulus parameters during experiments. Furthermore, it has built-in support for driving external equipment, as well as for synchronization tasks, via USB ports. The use of GEARS does not require elaborate programming skills. The necessary scripting is visually aided by an intuitive interface, while the details of the underlying software and hardware components remain hidden. Internally, the software is a C++/Python hybrid using OpenGL graphics. Computations are performed on the GPU, and are defined in the GLSL shading language. However, all GPU settings, including the GPU shader programs, are automatically generated by GEARS. This is configured through a method encountered in game programming, which allows high flexibility: stimuli are straightforwardly composed using a broad library of basic components. Stimulus rendering is implemented solely in C++, therefore intermediary libraries for interfacing could be omitted. This enables the program to perform computationally demanding tasks like en-masse random number generation or real-time image processing by local and global operations.

  11. Analysis for Parallel Execution without Performing Hardware/Software Co-simulation

    Muhammad Rashid

    2014-01-01

    Hardware/software co-simulation improves the performance of embedded applications by executing the applications on a virtual platform before the actual hardware is available in silicon. However, the virtual platform of the target architecture is often not available during early stages of the embedded design flow. Consequently, analysis for parallel execution without performing hardware/software co-simulation is required. This article presents an analysis methodology for parallel execution of ...

  12. A performance improvement plan to increase nurse adherence to use of medication safety software.

    Gavriloff, Carrie

    2012-08-01

    Nurses can protect patients receiving intravenous (IV) medication by using medication safety software to program "smart" pumps to administer IV medications. After a patient safety event identified inconsistent use of medication safety software by nurses, a performance improvement team implemented the Deming Cycle performance improvement methodology. The combined use of improved direct care nurse communication, programming strategies, staff education, medication safety champions, adherence monitoring, and technology acquisition resulted in a statistically significant (p < .001) increase in nurse adherence to using medication safety software from 28% to above 85%, exceeding national benchmark adherence rates (Cohen, Cooke, Husch & Woodley, 2007; Carefusion, 2011). Copyright © 2012 Elsevier Inc. All rights reserved.

  13. FPGAs for software programmers

    Hannig, Frank; Ziener, Daniel

    2016-01-01

    This book makes powerful Field Programmable Gate Array (FPGA) and reconfigurable technology accessible to software engineers by covering different state-of-the-art high-level synthesis approaches (e.g., OpenCL and several C-to-gates compilers). It introduces FPGA technology, its programming model, and how various applications can be implemented on FPGAs without going through low-level hardware design phases. Readers will get a realistic sense for problems that are suited for FPGAs and how to implement them from a software designer’s point of view. The authors demonstrate that FPGAs and their programming model reflect the needs of stream processing problems much better than traditional CPU or GPU architectures, making them well-suited for a wide variety of systems, from embedded systems performing sensor processing to large setups for Big Data number crunching. This book serves as an invaluable tool for software designers and FPGA design engineers who are interested in high design productivity through behavi...

  14. FULL GPU Implementation of Lattice-Boltzmann Methods with Immersed Boundary Conditions for Fast Fluid Simulations

    G Boroni

    2017-03-01

    Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.

  15. Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

    Nagaoka, Tomoaki; Watanabe, Soichi

    2011-01-01

    Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.

  16. Planning and Analysis of the Company’s Financial Performances by Using a Software Simulation

    Meri BOSHKOSKA

    2017-06-01

    Full Text Available Information Technology includes a wide range of software solution that helps managers in decision making processes in order to increase the company's business performance. Using software solution in financial analysis is a valuable tool for managers in the financial decision making process. The objective of the study was accomplished by developing Software that easily determines the financial performances of the company through integration of the analysis of financial indicators and DuPont profitability analysis model. Through this software, managers will be able to calculate the current financial state and visually analyze how their actions will affect the financial performance of the company. This will enable them to identify the best ways to improve the financial performance of the company. The software can perform a financial analysis and give a clear, useful overview of the current business performance and can also help in planning the growth of the company. The Software can also be implemented in educational purposes for students and managers in the field of financial management.

  17. Evolution of GPU nuclear's training program

    Long, R.L.; Coe, R.P.

    1987-01-01

    GPU Nuclear Corporation (GPUN) manages the operators of Three Mile Island Unit 1 and Oyster Creek Nuclear Generating Stations and the recovery activities at the Three Mile Island Unit 2 plant. From the time it was formed in January 1980 GPUN emphasized the use of behavioral learning objectives as the basis for all its training programs. This paper describes the evolution to a formalized performance based Training System Development (TSD) Process. The Training and Education Department staff increased from 10 in 1979 to the current 120 dedicated professionals, with a corresponding increase in facilities and acquisition of sophisticated Basic Principles Training Simulators and a Three Mile Island Unit 1 control Room Replica Simulator. The impact of these developments and achievement of full INPO accreditation are discussed and related to plant performance improvements

  18. Remote software upload techniques in future vehicles and their performance analysis

    Hossain, Irina

    could benefit from it. However, like the unicast RSU, the security requirements of multicast communication, i.e., authenticity, confidentiality and integrity of the software transmitted and access control of the group members is challenging. In this thesis, an infrastructure-based mobile multicasting for RSU in vehicle ECUs is proposed where an ECU receives the software from a remote software distribution center using the road side BSs as gateways. The Vehicular Software Distribution Network (VSDN) is divided into small regions administered by a Regional Group Manager (RGM). Two multicast Group Key Management (GKM) techniques are proposed based on the degree of trust on the BSs named Fully-trusted (FT) and Semi-trusted (ST) systems. Analytical models are developed to find the multicast session establishment latency and handover latency for these two protocols. The average latency to perform mutual authentication of the software vendor and a vehicle, and to send the multicast session key by the software provider during multicast session initialization, and the handoff latency during multicast session is calculated. Analytical and simulation results show that the link establishment latency per vehicle of our proposed schemes is in the range of few seconds and the ST system requires few ms higher time than the FT system. The handoff latency is also in the range of few seconds and in some cases ST system requires less handoff time than the FT system. Thus, it is possible to build an efficient GKM protocol without putting too much trust on the BSs.

  19. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

  20. GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52)

    Alvanos, Michail; Christoudias, Theodoros

    2017-10-01

    This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate-chemistry model simulations. We developed a software package that automatically generates CUDA kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC), used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP) general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported.Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 × and 20. 4 × , respectively, of the kernel execution time. A node-to-node real-world production performance comparison shows a 1. 75 × speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated kernel the CPU-only code.The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

  1. GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC Earth system model (version 2.52

    M. Alvanos

    2017-10-01

    Full Text Available This paper presents an application of GPU accelerators in Earth system modeling. We focus on atmospheric chemical kinetics, one of the most computationally intensive tasks in climate–chemistry model simulations. We developed a software package that automatically generates CUDA kernels to numerically integrate atmospheric chemical kinetics in the global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC, used to study climate change and air quality scenarios. A source-to-source compiler outputs a CUDA-compatible kernel by parsing the FORTRAN code generated by the Kinetic PreProcessor (KPP general analysis tool. All Rosenbrock methods that are available in the KPP numerical library are supported.Performance evaluation, using Fermi and Pascal CUDA-enabled GPU accelerators, shows achieved speed-ups of 4. 5 ×  and 20. 4 × , respectively, of the kernel execution time. A node-to-node real-world production performance comparison shows a 1. 75 ×  speed-up over the non-accelerated application using the KPP three-stage Rosenbrock solver. We provide a detailed description of the code optimizations used to improve the performance including memory optimizations, control code simplification, and reduction of idle time. The accuracy and correctness of the accelerated implementation are evaluated by comparing to the CPU-only code of the application. The median relative difference is found to be less than 0.000000001 % when comparing the output of the accelerated kernel the CPU-only code.The approach followed, including the computational workload division, and the developed GPU solver code can potentially be used as the basis for hardware acceleration of numerous geoscientific models that rely on KPP for atmospheric chemical kinetics applications.

  2. Software Design Document for the AMP Nuclear Fuel Performance Code

    Philip, Bobby; Clarno, Kevin T.; Cochran, Bill

    2010-01-01

    The purpose of this document is to describe the design of the AMP nuclear fuel performance code. It provides an overview of the decomposition into separable components, an overview of what those components will do, and the strategic basis for the design. The primary components of a computational physics code include a user interface, physics packages, material properties, mathematics solvers, and computational infrastructure. Some capability from established off-the-shelf (OTS) packages will be leveraged in the development of AMP, but the primary physics components will be entirely new. The material properties required by these physics operators include many highly non-linear properties, which will be replicated from FRAPCON and LIFE where applicable, as well as some computationally-intensive operations, such as gap conductance, which depends upon the plenum pressure. Because there is extensive capability in off-the-shelf leadership class computational solvers, AMP will leverage the Trilinos, PETSc, and SUNDIALS packages. The computational infrastructure includes a build system, mesh database, and other building blocks of a computational physics package. The user interface will be developed through a collaborative effort with the Nuclear Energy Advanced Modeling and Simulation (NEAMS) Capability Transfer program element as much as possible and will be discussed in detail in a future document.

  3. GPU TECHNOLOGIES EMBODIED IN PARALLEL SOLVERS OF LINEAR ALGEBRAIC EQUATION SYSTEMS

    Sidorov Alexander Vladimirovich

    2012-10-01

    Full Text Available The author reviews existing shareware solvers that are operated by graphical computer devices. The purpose of this review is to explore the opportunities and limitations of the above parallel solvers applicable for resolution of linear algebraic problems that arise at Research and Educational Centre of Computer Modeling at MSUCE, and Research and Engineering Centre STADYO. The author has explored new applications of the GPU in the PETSc suite and compared them with the results generated absent of the GPU. The research is performed within the CUSP library developed to resolve the problems of linear algebra through the application of GPU. The author has also reviewed the new MAGMA project which is analogous to LAPACK for the GPU.

  4. Performance of student software development teams: the influence of personality and identifying as team members

    Monaghan, Conal; Bizumic, Boris; Reynolds, Katherine; Smithson, Michael; Johns-Boast, Lynette; van Rooy, Dirk

    2015-01-01

    One prominent approach in the exploration of the variations in project team performance has been to study two components of the aggregate personalities of the team members: conscientiousness and agreeableness. A second line of research, known as self-categorisation theory, argues that identifying as team members and the team's performance norms should substantially influence the team's performance. This paper explores the influence of both these perspectives in university software engineering project teams. Eighty students worked to complete a piece of software in small project teams during 2007 or 2008. To reduce limitations in statistical analysis, Monte Carlo simulation techniques were employed to extrapolate from the results of the original sample to a larger simulated sample (2043 cases, within 319 teams). The results emphasise the importance of taking into account personality (particularly conscientiousness), and both team identification and the team's norm of performance, in order to cultivate higher levels of performance in student software engineering project teams.

  5. The design, validation, and performance of Grace

    Ru Zhu

    2016-05-01

    Full Text Available The design, validation and performance of Grace, a GPU-accelerated micromagnetic simulation software, are presented. The software adopts C+ + Accelerated Massive Parallelism (C+ + AMP so that it runs on GPUs from various hardware vendors including NVidia, AMD and Intel. At large simulation scales, up to two orders of magnitude of speedup factor is observed, compared to CPU-based micromagnetic simulation software OOMMF. The software can run on high-end professional GPUs as well as budget personal laptops, and is free to download.

  6. Interior Point Methods on GPU with application to Model Predictive Control

    Gade-Nielsen, Nicolai Fog

    The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now...... software package called GPUOPT, available under the non-restrictive MIT license. GPUOPT includes includes a primal-dual interior-point method, which supports both the CPU and the GPU. It is implemented as multiple components, where the matrix operations and solver for the Newton directions is separated...

  7. Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing

    Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.

    2014-12-01

    After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.

  8. GPU-based prompt gamma ray imaging from boron neutron capture therapy

    Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum

    2015-01-01

    Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations

  9. The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

    Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

    2010-09-13

    In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.

  10. GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

    Xie, Lang; Luo, Yi-han; Bao, Qi-liang

    2013-08-01

    GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.

  11. JMorph: Software for performing rapid morphometric measurements on digital images of fossil assemblages

    Lelièvre, Peter G.; Grey, Melissa

    2017-08-01

    Quantitative morphometric analyses of form are widely used in palaeontology, especially for taxonomic and evolutionary research. These analyses can involve several measurements performed on hundreds or even thousands of samples. Performing measurements of size and shape on large assemblages of macro- or microfossil samples is generally infeasible or impossible with traditional instruments such as vernier calipers. Instead, digital image processing software is required to perform measurements via suitable digital images of samples. Many software packages exist for morphometric analyses but there is not much available for the integral stage of data collection, particularly for the measurement of the outlines of samples. Some software exists to automatically detect the outline of a fossil sample from a digital image. However, automatic outline detection methods may perform inadequately when samples have incomplete outlines or images contain poor contrast between the sample and staging background. Hence, a manual digitization approach may be the only option. We are not aware of any software packages that are designed specifically for efficient digital measurement of fossil assemblages with numerous samples, especially for the purposes of manual outline analysis. Throughout several previous studies, we have developed a new software tool, JMorph, that is custom-built for that task. JMorph provides the means to perform many different types of measurements, which we describe in this manuscript. We focus on JMorph's ability to rapidly and accurately digitize the outlines of fossils. JMorph is freely available from the authors.

  12. Enhancing professionalism at GPU Nuclear

    Coe, R.P.

    1992-01-01

    Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing professionalism at its Oyster Creek and Three Mile Island nuclear generating stations. The program was also to include its corporate headquarters in Parsippany, New Jersey. The overall program was to take several directions, including on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for senior reactor operators, and expanded teamwork and leadership training for control room crew. The largest portion of this initiative was the development and delivery of professionalism training to the nearly 2,000 people at both nuclear generating sites

  13. NAVIER-STOKES EM GPU

    ALEX LAIER BORDIGNON

    2006-01-01

    Nesse trabalho, mostramos como simular um fluido em duas dimensões em um domínio com fronteiras arbitrárias. Nosso trabalho é baseado no esquema stable fluids desenvolvido por Joe Stam. A implementação é feita na GPU (Graphics Processing Unit), permitindo velocidade de interação com o fluido. Fazemos uso da linguagem Cg (C for Graphics), desenvolvida pela companhia NVidia. Nossas principais contribuições são o tratamento das múltiplas fronteiras, o...

  14. Supporting Performance Isolation in Software as a Service Systems with Rich Clients

    Oral, A.; Tekinerdogan, B.

    2015-01-01

    In a non-isolated Software as a Service (SaaS) system, different clients can freely use the resources of the SaaS. Hereby, disruptive tenants who exceed their limits can easily cause degradation of performance of the provided services for other tenants. To ensure performance demands of the multiple

  15. GPU-accelerated simulations of isolated black holes

    Lewis, Adam G. M.; Pfeiffer, Harald P.

    2018-05-01

    We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.

  16. A Kepler Workflow Tool for Reproducible AMBER GPU Molecular Dynamics.

    Purawat, Shweta; Ieong, Pek U; Malmstrom, Robert D; Chan, Garrett J; Yeung, Alan K; Walker, Ross C; Altintas, Ilkay; Amaro, Rommie E

    2017-06-20

    With the drive toward high throughput molecular dynamics (MD) simulations involving ever-greater numbers of simulation replicates run for longer, biologically relevant timescales (microseconds), the need for improved computational methods that facilitate fully automated MD workflows gains more importance. Here we report the development of an automated workflow tool to perform AMBER GPU MD simulations. Our workflow tool capitalizes on the capabilities of the Kepler platform to deliver a flexible, intuitive, and user-friendly environment and the AMBER GPU code for a robust and high-performance simulation engine. Additionally, the workflow tool reduces user input time by automating repetitive processes and facilitates access to GPU clusters, whose high-performance processing power makes simulations of large numerical scale possible. The presented workflow tool facilitates the management and deployment of large sets of MD simulations on heterogeneous computing resources. The workflow tool also performs systematic analysis on the simulation outputs and enhances simulation reproducibility, execution scalability, and MD method development including benchmarking and validation. Copyright © 2017 Biophysical Society. Published by Elsevier Inc. All rights reserved.

  17. Software development tools using GPGPU potentialities

    Dudnik, V.A.; Kudryavtsev, V.I.; Sereda, T.M.; Us, S.A.; Shestakov, M.V.

    2011-01-01

    The paper deals with potentialities of various up-to-date software development tools for making use of graphic processor (GPU) parallel computing resources. Examples are given to illustrate the use of present-day software tools for the development of applications and realization of algorithms for scientific-technical calculations performed by GPGPU. The paper presents some classes of hard mathematical problems of scientific-technical calculations, for which the GPGPU can be efficiently used. is possible. To reduce the time of calculation program development with the use of GPGPU capabilities, various dedicated programming systems and problem-oriented subroutine libraries are recommended. Performance parameters when solving the problems with and without the use of GPGPU potentialities are compared.

  18. Integrative multicellular biological modeling: a case study of 3D epidermal development using GPU algorithms

    Christley Scott

    2010-08-01

    Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a

  19. Parallel-hierarchical processing and classification of laser beam profile images based on the GPU-oriented architecture

    Yarovyi, Andrii A.; Timchenko, Leonid I.; Kozhemiako, Volodymyr P.; Kokriatskaia, Nataliya I.; Hamdi, Rami R.; Savchuk, Tamara O.; Kulyk, Oleksandr O.; Surtel, Wojciech; Amirgaliyev, Yedilkhan; Kashaganova, Gulzhan

    2017-08-01

    The paper deals with a problem of insufficient productivity of existing computer means for large image processing, which do not meet modern requirements posed by resource-intensive computing tasks of laser beam profiling. The research concentrated on one of the profiling problems, namely, real-time processing of spot images of the laser beam profile. Development of a theory of parallel-hierarchic transformation allowed to produce models for high-performance parallel-hierarchical processes, as well as algorithms and software for their implementation based on the GPU-oriented architecture using GPGPU technologies. The analyzed performance of suggested computerized tools for processing and classification of laser beam profile images allows to perform real-time processing of dynamic images of various sizes.

  20. Frameworks for Performing on Cloud Automated Software Testing Using Swarm Intelligence Algorithm: Brief Survey

    Mohammad Hossain

    2018-04-01

    Full Text Available This paper surveys on Cloud Based Automated Testing Software that is able to perform Black-box testing, White-box testing, as well as Unit and Integration Testing as a whole. In this paper, we discuss few of the available automated software testing frameworks on the cloud. These frameworks are found to be more efficient and cost effective because they execute test suites over a distributed cloud infrastructure. One of the framework effectiveness was attributed to having a module that accepts manual test cases from users and it prioritize them accordingly. Software testing, in general, accounts for as much as 50% of the total efforts of the software development project. To lessen the efforts, one the frameworks discussed in this paper used swarm intelligence algorithms. It uses the Ant Colony Algorithm for complete path coverage to minimize time and the Bee Colony Optimization (BCO for regression testing to ensure backward compatibility.

  1. GPU-based large-scale visualization

    Hadwiger, Markus

    2013-11-19

    Recent advances in image and volume acquisition as well as computational advances in simulation have led to an explosion of the amount of data that must be visualized and analyzed. Modern techniques combine the parallel processing power of GPUs with out-of-core methods and data streaming to enable the interactive visualization of giga- and terabytes of image and volume data. A major enabler for interactivity is making both the computational and the visualization effort proportional to the amount of data that is actually visible on screen, decoupling it from the full data size. This leads to powerful display-aware multi-resolution techniques that enable the visualization of data of almost arbitrary size. The course consists of two major parts: An introductory part that progresses from fundamentals to modern techniques, and a more advanced part that discusses details of ray-guided volume rendering, novel data structures for display-aware visualization and processing, and the remote visualization of large online data collections. You will learn how to develop efficient GPU data structures and large-scale visualizations, implement out-of-core strategies and concepts such as virtual texturing that have only been employed recently, as well as how to use modern multi-resolution representations. These approaches reduce the GPU memory requirements of extremely large data to a working set size that fits into current GPUs. You will learn how to perform ray-casting of volume data of almost arbitrary size and how to render and process gigapixel images using scalable, display-aware techniques. We will describe custom virtual texturing architectures as well as recent hardware developments in this area. We will also describe client/server systems for distributed visualization, on-demand data processing and streaming, and remote visualization. We will describe implementations using OpenGL as well as CUDA, exploiting parallelism on GPUs combined with additional asynchronous

  2. GPU-Accelerated Real-Time Surveillance De-Weathering

    Pettersson, Niklas

    2013-01-01

    A fully automatic de-weathering system to increase the visibility/stability in surveillance applications during bad weather has been developed. Rain, snow and haze during daylight are handled in real-time performance with acceleration from CUDA implemented algorithms. Video from fixed cameras is processed on a PC with no need of special hardware except an NVidia GPU. The system does not use any background model and does not require any precalibration. Increase in contrast is obtained in all h...

  3. PIPER: Performance Insight for Programmers and Exascale Runtimes: Guiding the Development of the Exascale Software Stack

    Mellor-Crummey, John [Rice Univ., Houston, TX (United States)

    2017-10-20

    The PIPER project set out to develop methodologies and software for measurement, analysis, attribution, and presentation of performance data for extreme-scale systems. Goals of the project were to support analysis of massive multi-scale parallelism, heterogeneous architectures, multi-faceted performance concerns, and to support both post-mortem performance analysis to identify program features that contribute to problematic performance and on-line performance analysis to drive adaptation. This final report summarizes the research and development activity at Rice University as part of the PIPER project. Producing a complete suite of performance tools for exascale platforms during the course of this project was impossible since both hardware and software for exascale systems is still a moving target. For that reason, the project focused broadly on the development of new techniques for measurement and analysis of performance on modern parallel architectures, enhancements to HPCToolkit’s software infrastructure to support our research goals or use on sophisticated applications, engaging developers of multithreaded runtimes to explore how support for tools should be integrated into their designs, engaging operating system developers with feature requests for enhanced monitoring support, engaging vendors with requests that they add hardware measure- ment capabilities and software interfaces needed by tools as they design new components of HPC platforms including processors, accelerators and networks, and finally collaborations with partners interested in using HPCToolkit to analyze and tune scalable parallel applications.

  4. A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

    Chun-Liang Lee

    Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.

  5. A real-time spike sorting method based on the embedded GPU.

    Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

    2017-07-01

    Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.

  6. An SDR-Based Real-Time Testbed for GNSS Adaptive Array Anti-Jamming Algorithms Accelerated by GPU

    Hailong Xu

    2016-03-01

    Full Text Available Nowadays, software-defined radio (SDR has become a common approach to evaluate new algorithms. However, in the field of Global Navigation Satellite System (GNSS adaptive array anti-jamming, previous work has been limited due to the high computational power demanded by adaptive algorithms, and often lack flexibility and configurability. In this paper, the design and implementation of an SDR-based real-time testbed for GNSS adaptive array anti-jamming accelerated by a Graphics Processing Unit (GPU are documented. This testbed highlights itself as a feature-rich and extendible platform with great flexibility and configurability, as well as high computational performance. Both Space-Time Adaptive Processing (STAP and Space-Frequency Adaptive Processing (SFAP are implemented with a wide range of parameters. Raw data from as many as eight antenna elements can be processed in real-time in either an adaptive nulling or beamforming mode. To fully take advantage of the parallelism resource provided by the GPU, a batched method in programming is proposed. Tests and experiments are conducted to evaluate both the computational and anti-jamming performance. This platform can be used for research and prototyping, as well as a real product in certain applications.

  7. Development of GPU Based Parallel Computing Module for Solving Pressure Equation in the CUPID Component Thermo-Fluid Analysis Code

    Lee, Jin Pyo; Joo, Han Gyu

    2010-01-01

    In the thermo-fluid analysis code named CUPID, the linear system of pressure equations must be solved in each iteration step. The time for repeatedly solving the linear system can be quite significant because large sparse matrices of Rank more than 50,000 are involved and the diagonal dominance of the system is hardly hold. Therefore parallelization of the linear system solver is essential to reduce the computing time. Meanwhile, Graphics Processing Units (GPU) have been developed as highly parallel, multi-core processors for the global demand of high quality 3D graphics. If a suitable interface is provided, parallelization using GPU can be available to engineering computing. NVIDIA provides a Software Development Kit(SDK) named CUDA(Compute Unified Device Architecture) to code developers so that they can manage GPUs for parallelization using the C language. In this research, we implement parallel routines for the linear system solver using CUDA, and examine the performance of the parallelization. In the next section, we will describe the method of CUDA parallelization for the CUPID code, and then the performance of the CUDA parallelization will be discussed

  8. Enhancing professionalism at GPU nuclear

    Coe, R.P.; Landy, F.J.

    1991-01-01

    Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing Professionalism at its Oyster Creek and Three Mile Island Nuclear Generating Stations. The program was also to include its Corporate Headquarters in Parsippany, New Jersey. The overall program was to take several directions which included on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for SROs and expanded teamwork and leadership training for control room crews. The largest portion of this initiative was the development and delivery of professionalism training to the nearly two thousand people at both sites. Three primary philosophies guided the development of the program. Employees as Experts: First, GPU Nuclear employees were considered to be the most valuable source of information for designing a Professionalism program because it is these individuals who are sensitive to the issues encountered in the workplace. Realism: The second philosophy guiding this effort was that the program must be grounded in real life challenges that employees face and must address. Active Learning: The third guiding philosophy was that, in order to have any real impact on the way employees think about professionalism, the program must utilize active rather than passive learning techniques

  9. Risk management at GPU Nuclear

    Long, R.L.

    1991-01-01

    This paper reports on GPU Nuclear. Among other goals, it established the independence of key safety functions as highlighted by the lessons learned from the accident. In particular, an independent Nuclear Assurance Division was established which include Quality Assurance, Training and Education, Emergency Preparedness, and Nuclear Safety Assessment. The latter consisted of corporate and site independent-safety-review groups. As the GPU Nuclear organization matured, a mid-1987 reorganization created an even more focused Planning and Nuclear Safety Division bringing together Nuclear Safety Assessment with Licensing and Regulatory Affairs and Risk Management. The Risk Management Group (RMG), which began its work in fall 1987, was formed to develop a framework for proactive identification, evaluation, and cost-effective reduction and management of risks of all types. The RMG set out to learn as much as possible about risks and their management in nuclear and other high-technology industries. This began with a thorough literature search. It progressed to interviews with individuals and organizations which have demonstrated innovative ideas, experience, and reputations for safe and reliable operation

  10. Analysing sensory panel performance in a proficiency test using the PanelCheck software

    Tomic, O.; Luciano, G.; Nilsen, A.

    2010-01-01

    Check software, a workflow is proposed that guides the user through the data analysis process. This allows practitioners and non-statisticians to get an overview over panel performances in a rapid manner without the need to be familiar with details on the statistical methods. Visualisation of data analysis...... results plays an important role as this provides a time saving and efficient way of screening and investigating sensory panel performances. Most of the statistical methods used in this paper are available in the open source software PanelCheck, which may be downloaded and used for free....

  11. The gputools package enables GPU computing in R.

    Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

    2010-01-01

    By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu

  12. SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems

    Xiao, K; Chen, D. Z; Hu, X. S [University of Notre Dame, Notre Dame, IN (United States); Zhou, B [Altera Corp., San Jose, CA (United States)

    2014-06-01

    Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF

  13. GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

    Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

    2012-09-01

    Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC

  14. Study on GPU Computing for SCOPE2 with CUDA

    Kodama, Yasuhiro; Tatsumi, Masahiro; Ohoka, Yasunori

    2011-01-01

    For improving safety and cost effectiveness of nuclear power plants, a core calculation code SCOPE2 has been developed, which adopts detailed calculation models such as the multi-group nodal SP3 transport calculation method in three-dimensional pin-by-pin geometry to achieve high predictability. However, it is difficult to apply the code to loading pattern optimizations since it requires much longer computation time than that of codes based on the nodal diffusion method which is widely used in core design calculations. In this study, we studied possibility of acceleration of SCOPE2 with GPU computing capability which has been recognized as one of the most promising direction of high performance computing. In the previous study with an experimental programming framework, it required much effort to convert the algorithms to ones which fit to GPU computation. It was found, however, that this conversion was tremendously difficult because of the complexity of algorithms and restrictions in implementation. In this study, to overcome this complexity, we utilized the CUDA programming environment provided by NVIDIA which is a versatile and flexible language as an extension to the C/C++ languages. It was confirmed that we could enjoy high performance without degradation of maintainability through test implementation of GPU kernels for neutron diffusion/simplified P3 equation solvers. (author)

  15. High-speed optical coherence tomography signal processing on GPU

    Li Xiqi; Shi Guohua; Zhang Yudong

    2011-01-01

    The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).

  16. Maximizing Use of Extension Beef Cattle Benchmarks Data Derived from Cow Herd Appraisal Performance Software

    Ramsay, Jennifer M.; Hanna, Lauren L. Hulsman; Ringwall, Kris A.

    2016-01-01

    One goal of Extension is to provide practical information that makes a difference to producers. Cow Herd Appraisal Performance Software (CHAPS) has provided beef producers with production benchmarks for 30 years, creating a large historical data set. Many such large data sets contain useful information but are underutilized. Our goal was to create…

  17. Multi-GPU implementation of a VMAT treatment plan optimization algorithm

    Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B.; Peng, Fei

    2015-01-01

    Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is

  18. Multi-GPU implementation of a VMAT treatment plan optimization algorithm

    Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)

    2015-06-15

    Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is

  19. Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space

    Richard Wilton

    2015-03-01

    Full Text Available When computing alignments of DNA sequences to a large genome, a key element in achieving high processing throughput is to prioritize locations in the genome where high-scoring mappings might be expected. We formulated this task as a series of list-processing operations that can be efficiently performed on graphics processing unit (GPU hardware.We followed this approach in implementing a read aligner called Arioc that uses GPU-based parallel sort and reduction techniques to identify high-priority locations where potential alignments may be found. We then carried out a read-by-read comparison of Arioc’s reported alignments with the alignments found by several leading read aligners. With simulated reads, Arioc has comparable or better accuracy than the other read aligners we tested. With human sequencing reads, Arioc demonstrates significantly greater throughput than the other aligners we evaluated across a wide range of sensitivity settings. The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

  20. Distributed control software of high-performance control-loop algorithm

    Blanc, D

    1999-01-01

    The majority of industrial cooling and ventilation plants require the control of complex processes. All these processes are highly important for the operation of the machines. The stability and reliability of these processes are leading factors identifying the quality of the service provided. The control system architecture and software structure, as well, are required to have high dynamical performance and robust behaviour. The intelligent systems based on PID or RST controllers are used for their high level of stability and accuracy. The design and tuning of these complex controllers require the dynamic model of the plant to be known (generally obtained by identification) and the desired performance of the various control loops to be specified for achieving good performances. The concept of having a distributed control algorithm software provides full automation facilities with well-adapted functionality and good performances, giving methodology, means and tools to master the dynamic process optimization an...

  1. Representation of the Physiological Factors Contributing to Postflight Changes in Functional Performance Using Motion Analysis Software

    Parks, Kelsey

    2010-01-01

    Astronauts experience changes in multiple physiological systems due to exposure to the microgravity conditions of space flight. To understand how changes in physiological function influence functional performance, a testing procedure has been developed that evaluates both astronaut postflight functional performance and related physiological changes. Astronauts complete seven functional and physiological tests. The objective of this project is to use motion tracking and digitizing software to visually display the postflight decrement in the functional performance of the astronauts. The motion analysis software will be used to digitize astronaut data videos into stick figure videos to represent the astronauts as they perform the Functional Tasks Tests. This project will benefit NASA by allowing NASA scientists to present data of their neurological studies without revealing the identities of the astronauts.

  2. A Data Specification for Software Project Performance Measures: Results of a Collaboration on Performance Measurement

    2008-07-01

    cycle Evolution of a system, product, service, project or other human-made entity from conception through retirement [ ISO 12207 ]. Logical line of...012 [ ISO 1995] International Organization for Standardization. ISO /IEC 12207 :1995—Information technology— Software life cycle processes. http...definitions, authors were asked to use or align with already existing standards such as those available through ISO and IEEE when possible. Literature

  3. THEWASP library. Thermodynamic water and steam properties library in GPU

    Waintraub, M.; Lapa, C.M.F.; Mol, A.C.A.; Heimlich, A.

    2011-01-01

    In this paper we present a new library for thermodynamic evaluation of water properties, THEWASP. This library consists of a C++ and CUDA based programs used to accelerate a function evaluation using GPU and GPU clusters. Global optimization problems need thousands of evaluations of the objective functions to nd the global optimum implying in several days of expensive processing. This problem motivates to seek a way to speed up our code, as well as to use MPI on Beowulf clusters, which however increases the cost in terms of electricity, air conditioning and others. The GPU based programming can accelerate the implementation up to 100 times and help increase the number of evaluations in global optimization problems using, for example, the PSO or DE Algorithms. THEWASP is based on Water-Steam formulations publish by the International Association for the properties of water and steam, Lucerne - Switzerland, and provides several temperature and pressure function evaluations, such as specific heat, specific enthalpy, specific entropy and also some inverse maps. In this study we evaluated the gain in speed and performance and compared it a CPU based processing library. (author)

  4. Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU

    Arefan D

    2015-06-01

    Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.

  5. GPU acceleration of Dock6's Amber scoring computation.

    Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu

    2010-01-01

    Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.

  6. Democratic population decisions result in robust policy-gradient learning: a parametric study with GPU simulations.

    Paul Richmond

    2011-05-01

    Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.

  7. AN ENHANCED MODEL TO ESTIMATE EFFORT, PERFORMANCE AND COST OF THE SOFTWARE PROJECTS

    M. Pauline

    2013-04-01

    Full Text Available The Authors have proposed a model that first captures the fundamentals of software metrics in the phase 1 consisting of three primitive primary software engineering metrics; they are person-months (PM, function-points (FP, and lines of code (LOC. The phase 2 consists of the proposed function point which is obtained by grouping the adjustment factors to simplify the process of adjustment and to ensure more consistency in the adjustments. In the proposed method fuzzy logic is used for quantifying the quality of requirements and is added as one of the adjustment factor, thus a fuzzy based approach for the Enhanced General System Characteristics to Estimate Effort of the Software Projects using productivity has been obtained. The phase 3 takes the calculated function point from our work and is given as input to the static single variable model (i.e. to the Intermediate COCOMO and COCOMO II for cost estimation. The Authors have tailored the cost factors in intermediate COCOMO and both; cost and scale factors are tailored in COCOMO II to suite to the individual development environment, which is very important for the accuracy of the cost estimates. The software performance indicators are project duration, schedule predictability, requirements completion ratio and post-release defect density, are also measured for the software projects in my work. A comparative study for effort, performance measurement and cost estimation of the software project is done between the existing model and the authors proposed work. Thus our work analyzes the interaction¬al process through which the estimation tasks were collectively accomplished.

  8. Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy

    Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

    2011-01-01

    We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30–16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9–67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning. (note)

  9. Evaluation of Software Quality to Improve Application Performance Using Mc Call Model

    Inda D Lestantri

    2018-04-01

    Full Text Available The existence of software should have more value to improve the performance of the organization in addition to having the primary function to automate. Before being implemented in an operational environment, software must pass the test gradually to ensure that the software is functioning properly, meeting user needs and providing convenience for users to use it. This test is performed on a web-based application, by taking a test case in an e-SAP application. E-SAP is an application used to monitor teaching and learning activities used by a university in Jakarta. To measure software quality, testing can be done on users randomly. The user samples selected in this test are users with an age range of 18 years old up to 25 years, background information technology. This test was conducted on 30 respondents. This test is done by using Mc Call model. Model of testing Mc Call consists of 11 dimensions are grouped into 3 categories. This paper describes the testing with reference to the category of product operation, which includes 5 dimensions. The dimensions of testing performed include the dimensions of correctness, usability, efficiency, reliability, and integrity. This paper discusses testing on each dimension to measure software quality as an effort to improve performance. The result of research is e-SAP application has good quality with product operation value equal to 85.09%. This indicates that the e-SAP application has a great quality, so this application deserves to be examined in the next stage on the operational environment.

  10. BROCCOLI: Software for Fast fMRI Analysis on Many-Core CPUs and GPUs

    Anders eEklund

    2014-03-01

    Full Text Available Analysis of functional magnetic resonance imaging (fMRI data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful graphics processing units (GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL (Open Computing Language that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further, dramatic speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU can perform non-linear spatial normalization to a 1 mm3 brain template in 4-6 seconds, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github (https://github.com/wanderine/BROCCOLI/.

  11. cudaMap: a GPU accelerated program for gene expression connectivity mapping.

    McArt, Darragh G; Bankhead, Peter; Dunne, Philip D; Salto-Tellez, Manuel; Hamilton, Peter; Zhang, Shu-Dong

    2013-10-11

    Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.

  12. Revisioning Theoretical Framework of Electronic Performance Support Systems (EPSS within the Software Application Examples

    Dr. Servet BAYRAM,

    2004-04-01

    Full Text Available Revisioning Theoretical Framework of Electronic Performance Support Systems (EPSS within the Software Application Examples Assoc. Prof. Dr. Servet BAYRAM Computer Education & Instructional Technologies Marmara University , TURKEY ABSTRACT EPSS provides electronic support to learners in achieving a performance objective; a feature which makes it universally and consistently available on demand any time, any place, regardless of situation, without unnecessary intermediaries involved in the process. The aim of this review is to develop a set of theoretical construct that provide descriptive power for explanation of EPSS and its roots and features within the software application examples (i.e., Microsoft SharePoint Server”v2.0” Beta 2, IBM Lotus Notes 6 & Domino 6, Oracle 9i Collaboration Suite, and Mac OS X v10.2. From the educational and training point of view, the paper visualizes a pentagon model for the interrelated domains of the theoretical framework of EPSS. These domains are: learning theories, information processing theories, developmental theories, instructional theories, and acceptance theories. This descriptive framework explains a set of descriptions as to which outcomes occur under given theoretical conditions for a given EPSS model within software examples. It summarizes some of the theoretical concepts supporting to the EPSS’ related features and explains how such concepts sharing same features with the example software programs in education and job training.

  13. Cucheb: A GPU implementation of the filtered Lanczos procedure

    Aurentz, Jared L.; Kalantzis, Vassilis; Saad, Yousef

    2017-11-01

    This paper describes the software package Cucheb, a GPU implementation of the filtered Lanczos procedure for the solution of large sparse symmetric eigenvalue problems. The filtered Lanczos procedure uses a carefully chosen polynomial spectral transformation to accelerate convergence of the Lanczos method when computing eigenvalues within a desired interval. This method has proven particularly effective for eigenvalue problems that arise in electronic structure calculations and density functional theory. We compare our implementation against an equivalent CPU implementation and show that using the GPU can reduce the computation time by more than a factor of 10. Program Summary Program title: Cucheb Program Files doi:http://dx.doi.org/10.17632/rjr9tzchmh.1 Licensing provisions: MIT Programming language: CUDA C/C++ Nature of problem: Electronic structure calculations require the computation of all eigenvalue-eigenvector pairs of a symmetric matrix that lie inside a user-defined real interval. Solution method: To compute all the eigenvalues within a given interval a polynomial spectral transformation is constructed that maps the desired eigenvalues of the original matrix to the exterior of the spectrum of the transformed matrix. The Lanczos method is then used to compute the desired eigenvectors of the transformed matrix, which are then used to recover the desired eigenvalues of the original matrix. The bulk of the operations are executed in parallel using a graphics processing unit (GPU). Runtime: Variable, depending on the number of eigenvalues sought and the size and sparsity of the matrix. Additional comments: Cucheb is compatible with CUDA Toolkit v7.0 or greater.

  14. GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

    Ammazzalorso, F; Jelen, U; Bednarz, T

    2014-01-01

    We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.

  15. GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

    Ammazzalorso, F.; Bednarz, T.; Jelen, U.

    2014-03-01

    We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.

  16. Performance analysis of a parallel Monte Carlo code for simulating solar radiative transfer in cloudy atmospheres using CUDA-enabled NVIDIA GPU

    Russkova, Tatiana V.

    2017-11-01

    One tool to improve the performance of Monte Carlo methods for numerical simulation of light transport in the Earth's atmosphere is the parallel technology. A new algorithm oriented to parallel execution on the CUDA-enabled NVIDIA graphics processor is discussed. The efficiency of parallelization is analyzed on the basis of calculating the upward and downward fluxes of solar radiation in both a vertically homogeneous and inhomogeneous models of the atmosphere. The results of testing the new code under various atmospheric conditions including continuous singlelayered and multilayered clouds, and selective molecular absorption are presented. The results of testing the code using video cards with different compute capability are analyzed. It is shown that the changeover of computing from conventional PCs to the architecture of graphics processors gives more than a hundredfold increase in performance and fully reveals the capabilities of the technology used.

  17. The Methods of Implementation of the Three-dimensional Pseudorandom Number Generator DOZEN for Heterogeneous CPU/GPU /FPGA High-performance Systems

    Nikolay Petrovich Vasilyev

    2015-03-01

    Full Text Available The paper describes the scope of information security protocols based on PRN G in industrial systems. A method for implementing three-dimensional pseudorandom number generator D O Z E N in hybrid systems is provided. The description and results of studies parallel CUDA-version of the algorithm for use in hybrid data centers and high-performance FPGA-version for use in hardware solutions in controlled facilities of SCADA-systems are given.

  18. GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

    Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

    2018-01-01

    In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.

  19. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

    Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu

    2011-01-01

    Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S n ) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.

  20. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.

    Mei, Gang; Xu, Nengxiong; Xu, Liangliang

    2016-01-01

    This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.

  1. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

    Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

    2011-07-01

    Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.

  2. The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

    Hamed Kargaran

    2016-04-01

    Full Text Available The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.

  3. The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

    Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad [Department of nuclear engineering, Shahid Behesti University, Tehran, 1983969411 (Iran, Islamic Republic of)

    2016-04-15

    The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.

  4. A Framework for Performing Verification and Validation in Reuse Based Software Engineering

    Addy, Edward A.

    1997-01-01

    Verification and Validation (V&V) is currently performed during application development for many systems, especially safety-critical and mission- critical systems. The V&V process is intended to discover errors, especially errors related to critical processing, as early as possible during the development process. The system application provides the context under which the software artifacts are validated. This paper describes a framework that extends V&V from an individual application system to a product line of systems that are developed within an architecture-based software engineering environment. This framework includes the activities of traditional application-level V&V, and extends these activities into domain engineering and into the transition between domain engineering and application engineering. The framework includes descriptions of the types of activities to be performed during each of the life-cycle phases, and provides motivation for the activities.

  5. Integrated State Estimation and Contingency Analysis Software Implementation using High Performance Computing Techniques

    Chen, Yousu; Glaesemann, Kurt R.; Rice, Mark J.; Huang, Zhenyu

    2015-12-31

    Power system simulation tools are traditionally developed in sequential mode and codes are optimized for single core computing only. However, the increasing complexity in the power grid models requires more intensive computation. The traditional simulation tools will soon not be able to meet the grid operation requirements. Therefore, power system simulation tools need to evolve accordingly to provide faster and better results for grid operations. This paper presents an integrated state estimation and contingency analysis software implementation using high performance computing techniques. The software is able to solve large size state estimation problems within one second and achieve a near-linear speedup of 9,800 with 10,000 cores for contingency analysis application. The performance evaluation is presented to show its effectiveness.

  6. GPU accelerated population annealing algorithm

    Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.

    2017-11-01

    steps and multi-histogram reweighting. Additional comments: Code repository at https://github.com/LevBarash/PAising. The system size and size of the population of replicas are limited depending on the memory of the GPU device used. For the default parameter values used in the sample programs, L = 64, θ = 100, β0 = 0, βf = 1, Δβ = 0 . 005, R = 20 000, a typical run time on an NVIDIA Tesla K80 GPU is 151 seconds for the single spin coded (SSC) and 17 seconds for the multi-spin coded (MSC) program (see Section 2 for a description of these parameters).

  7. Dynamic CT myocardial perfusion imaging: performance of 3D semi-automated evaluation software

    Ebersberger, Ullrich [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); Heart Center Munich-Bogenhausen, Department of Cardiology and Intensive Care Medicine, Munich (Germany); Marcus, Roy P.; Nikolaou, Konstantin; Bamberg, Fabian [University of Munich, Institute of Clinical Radiology, Munich (Germany); Schoepf, U.J.; Gray, J.C.; McQuiston, Andrew D. [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); Lo, Gladys G. [Hong Kong Sanatorium and Hospital, Department of Diagnostic and Interventional Radiology, Hong Kong (China); Wang, Yining [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Department of Radiology, Beijing (China); Blanke, Philipp [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); University Hospital Freiburg, Department of Diagnostic Radiology, Freiburg (Germany); Geyer, Lucas L. [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); University of Munich, Institute of Clinical Radiology, Munich (Germany); Cho, Young Jun [Medical University of South Carolina, Heart and Vascular Center, Charleston, SC (United States); Konyang University College of Medicine, Department of Radiology, Daejeon (Korea, Republic of); Scheuering, Michael; Canstein, Christian [Siemens Healthcare, CT Division, Forchheim (Germany); Hoffmann, Ellen [Heart Center Munich-Bogenhausen, Department of Cardiology and Intensive Care Medicine, Munich (Germany)

    2014-01-15

    To evaluate the performance of three-dimensional semi-automated evaluation software for the assessment of myocardial blood flow (MBF) and blood volume (MBV) at dynamic myocardial perfusion computed tomography (CT). Volume-based software relying on marginal space learning and probabilistic boosting tree-based contour fitting was applied to CT myocardial perfusion imaging data of 37 subjects. In addition, all image data were analysed manually and both approaches were compared with SPECT findings. Study endpoints included time of analysis and conventional measures of diagnostic accuracy. Of 592 analysable segments, 42 showed perfusion defects on SPECT. Average analysis times for the manual and software-based approaches were 49.1 ± 11.2 and 16.5 ± 3.7 min respectively (P < 0.01). There was strong agreement between the two measures of interest (MBF, ICC = 0.91, and MBV, ICC = 0.88, both P < 0.01) and no significant difference in MBF/MBV with respect to diagnostic accuracy between the two approaches for both MBF and MBV for manual versus software-based approach; respectively; all comparisons P > 0.05. Three-dimensional semi-automated evaluation of dynamic myocardial perfusion CT data provides similar measures and diagnostic accuracy to manual evaluation, albeit with substantially reduced analysis times. This capability may aid the integration of this test into clinical workflows. (orig.)

  8. A SOFTWARE TOOL TO COMPARE MEASURED AND SIMULATED BUILDING ENERGY PERFORMANCE DATA

    Maile, Tobias; Bazjanac, Vladimir; O' Donnell, James; Garr, Matthew

    2011-11-01

    Building energy performance is often inadequate when compared to design goals. To link design goals to actual operation one can compare measured with simulated energy performance data. Our previously developed comparison approach is the Energy Performance Comparison Methodology (EPCM), which enables the identification of performance problems based on a comparison of measured and simulated performance data. In context of this method, we developed a software tool that provides graphing and data processing capabilities of the two performance data sets. The software tool called SEE IT (Stanford Energy Efficiency Information Tool) eliminates the need for manual generation of data plots and data reformatting. SEE IT makes the generation of time series, scatter and carpet plots independent of the source of data (measured or simulated) and provides a valuable tool for comparing measurements with simulation results. SEE IT also allows assigning data points on a predefined building object hierarchy and supports different versions of simulated performance data. This paper briefly introduces the EPCM, describes the SEE IT tool and illustrates its use in the context of a building case study.

  9. Optimisation du produit matrice-vecteur creux sur architecture GPU pour un simulateur de réservoir

    Rossignon , Corentin

    2013-01-01

    National audience; For the Total Company, simulating reservoirs is an important step in the process of optimizing production. Nowadays, these simulations run entirely on CPUs. Thus, we have attempted to accelerate the sparse matrix-vector product operators of the simulation by using GPUs. Common GPU libraries for sparse linear algebra use generic formats for sparse matrix storage, that are more or less performant on GPU but that do not allow to fully exploit the specific structure of the matr...

  10. Porting AMG2013 to Heterogeneous CPU+GPU Nodes

    Samfass, Philipp [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

    2017-01-26

    LLNL's future advanced technology system SIERRA will feature heterogeneous compute nodes that consist of IBM PowerV9 CPUs and NVIDIA Volta GPUs. Conceptually, the motivation for such an architecture is quite straightforward: While GPUs are optimized for throughput on massively parallel workloads, CPUs strive to minimize latency for rather sequential operations. Yet, making optimal use of heterogeneous architectures raises new challenges for the development of scalable parallel software, e.g., with respect to work distribution. Porting LLNL's parallel numerical libraries to upcoming heterogeneous CPU+GPU architectures is therefore a critical factor for ensuring LLNL's future success in ful lling its national mission. One of these libraries, called HYPRE, provides parallel solvers and precondi- tioners for large, sparse linear systems of equations. In the context of this intern- ship project, I consider AMG2013 which is a proxy application for major parts of HYPRE that implements a benchmark for setting up and solving di erent systems of linear equations. In the following, I describe in detail how I ported multiple parts of AMG2013 to the GPU (Section 2) and present results for di erent experiments that demonstrate a successful parallel implementation on the heterogeneous ma- chines surface and ray (Section 3). In Section 4, I give guidelines on how my code should be used. Finally, I conclude and give an outlook for future work (Section 5).

  11. Study of the acceleration of nuclide burnup calculation using GPU with CUDA

    Okui, S.; Ohoka, Y.; Tatsumi, M.

    2009-01-01

    The computation costs of neutronics calculation code become higher as physics models and methods are complicated. The degree of them in neutronics calculation tends to be limited due to available computing power. In order to open a door to the new world, use of GPU for general purpose computing, called GPGPU, has been studied [1]. GPU has multi-threads computing mechanism enabled with multi-processors which realize mush higher performance than CPUs. NVIDIA recently released the CUDA language for general purpose computation which is a C-like programming language. It is relatively easy to learn compared to the conventional ones used for GPGPU, such as OpenGL or CG. Therefore application of GPU to the numerical calculation became much easier. In this paper, we tried to accelerate nuclide burnup calculation, which is important to predict nuclides time dependence in the core, using GPU with CUDA. We chose the 4.-order Runge-Kutta method to solve the nuclide burnup equation. The nuclide burnup calculation and the 4.-order Runge-Kutta method were suitable to the first step of introduction CUDA into numerical calculation because these consist of simple operations of matrices and vectors of single precision where actual codes were written in the C++ language. Our experimental results showed that nuclide burnup calculations with GPU have possibility of speedup by factor of 100 compared to that with CPU. (authors)

  12. Fast 3D elastic micro-seismic source location using new GPU features

    Xue, Qingfeng; Wang, Yibo; Chang, Xu

    2016-12-01

    In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.

  13. Computing OpenSURF on OpenCL and General Purpose GPU

    Wanglong Yan

    2013-10-01

    Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.

  14. Carotid artery stenosis: Performance of advanced vessel analysis software in evaluating CTA

    Tsiflikas, Ilias; Biermann, Christina; Thomas, Christoph; Ketelsen, Dominik; Claussen, Claus D.; Heuschmid, Martin

    2012-01-01

    Objectives: The aim of this study was to evaluate time efficiency and diagnostic reproducibility of an advanced vessel analysis software for diagnosis of carotid artery stenosis. Material and methods: 40 patients with suspected carotid artery stenosis received head and neck DE-CTA as part of their pre-interventional workup. Acquired data were evaluated by 2 independent radiologists. Stenosis grading was performed by MPR eyeballing with freely adjustable MPRs and with a preliminary prototype of the meanwhile available client-server and advanced visualization software syngo.via CT Vascular (Siemens Healthcare, Erlangen, Germany). Stenoses were graded according to the following 5 categories: I: 0%, II: 1–50%, III: 51–69%, IV: 70–99% and V: total occlusion. Furthermore, time to diagnosis for each carotid artery was recorded. Results: Both readers achieved very good specificity values and good respectively very good sensitivity values without significant differences between both reading methods. Furthermore, there was a very good correlation between both readers for both reading methods without significant differences (kappa value: standard image interpretation k = 0.809; advanced vessel analysis software k = 0.863). Using advanced vessel analysis software resulted in a significant time saving (p < 0.0001) for both readers. Time to diagnosis could be decreased by approximately 55%. Conclusions: Advanced vessel analysis application CT Vascular of the new imaging software syngo.via (Siemens Healthcare, Forchheim, Germany) provides a high rate of reproducibility in assessment of carotid artery stenosis. Furthermore a significant time saving in comparison to standard image interpretation is achievable

  15. Carotid artery stenosis: Performance of advanced vessel analysis software in evaluating CTA

    Tsiflikas, Ilias, E-mail: ilias.tsiflikas@med.uni-tuebingen.de [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany); Biermann, Christina, E-mail: christina.biermann@siemens.com [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany); Siemens AG, Siemens Healthcare Consulting, Allee am Röthelheimpark 3A, 91052 Erlangen (Germany); Thomas, Christoph, E-mail: christoph.thomas@med.uni-tuebingen.de [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany); Ketelsen, Dominik, E-mail: dominik.ketelsen@med.uni-tuebingen.de [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany); Claussen, Claus D., E-mail: claus.claussen@med.uni-tuebingen.de [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany); Heuschmid, Martin, E-mail: martin.heuschmid@med.uni-tuebingen.de [University Hospital of Tuebingen, Diagnostic and Interventional Radiology, Hoppe-Seyler-Str. 3, 72076 Tuebingen (Germany)

    2012-09-15

    Objectives: The aim of this study was to evaluate time efficiency and diagnostic reproducibility of an advanced vessel analysis software for diagnosis of carotid artery stenosis. Material and methods: 40 patients with suspected carotid artery stenosis received head and neck DE-CTA as part of their pre-interventional workup. Acquired data were evaluated by 2 independent radiologists. Stenosis grading was performed by MPR eyeballing with freely adjustable MPRs and with a preliminary prototype of the meanwhile available client-server and advanced visualization software syngo.via CT Vascular (Siemens Healthcare, Erlangen, Germany). Stenoses were graded according to the following 5 categories: I: 0%, II: 1–50%, III: 51–69%, IV: 70–99% and V: total occlusion. Furthermore, time to diagnosis for each carotid artery was recorded. Results: Both readers achieved very good specificity values and good respectively very good sensitivity values without significant differences between both reading methods. Furthermore, there was a very good correlation between both readers for both reading methods without significant differences (kappa value: standard image interpretation k = 0.809; advanced vessel analysis software k = 0.863). Using advanced vessel analysis software resulted in a significant time saving (p < 0.0001) for both readers. Time to diagnosis could be decreased by approximately 55%. Conclusions: Advanced vessel analysis application CT Vascular of the new imaging software syngo.via (Siemens Healthcare, Forchheim, Germany) provides a high rate of reproducibility in assessment of carotid artery stenosis. Furthermore a significant time saving in comparison to standard image interpretation is achievable.

  16. SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU

    Moriya, S; Sato, M [Komazawa University, Setagaya, Tokyo (Japan); Tachibana, H [National Cancer Center Hospital East, Kashiwa, Chiba (Japan)

    2015-06-15

    Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.

  17. Optimizing the Performance of Radionuclide Identification Software in the Hunt for Nuclear Security Threats

    Fotion, Katherine A. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

    2016-08-18

    The Radionuclide Analysis Kit (RNAK), my team’s most recent nuclide identification software, is entering the testing phase. A question arises: will removing rare nuclides from the software’s library improve its overall performance? An affirmative response indicates fundamental errors in the software’s framework, while a negative response confirms the effectiveness of the software’s key machine learning algorithms. After thorough testing, I found that the performance of RNAK cannot be improved with the library choice effect, thus verifying the effectiveness of RNAK’s algorithms—multiple linear regression, Bayesian network using the Viterbi algorithm, and branch and bound search.

  18. Optimisation of Software-Defined Networks Performance Using a Hybrid Intelligent System

    Ann Sabih

    2017-06-01

    Full Text Available This paper proposes a novel intelligent technique that has been designed to optimise the performance of Software Defined Networks (SDN. The proposed hybrid intelligent system has employed integration of intelligence-based optimisation approaches with the artificial neural network. These heuristic optimisation methods include Genetic Algorithms (GA and Particle Swarm Optimisation (PSO. These methods were utilised separately in order to select the best inputs to maximise SDN performance. In order to identify SDN behaviour, the neural network model is trained and applied. The maximal optimisation approach has been identified using an analytical approach that considered SDN performance and the computational time as objective functions. Initially, the general model of the neural network was tested with unseen data before implementing the model using GA and PSO to determine the optimal performance of SDN. The results showed that the SDN represented by Artificial Neural Network ANN, and optmised by PSO, generated a better configuration with regards to computational efficiency and performance index.

  19. New Software Performance with Balanced Score Card Assessment: Case Study at LPGI Jakarta

    Brata Wibawa Djojo

    2011-09-01

    Full Text Available Implementation of information technology (IT, especially new software applications, needs to be evaluated for its impact to organization’s business performance related to its strategic goal. The measurement and evaluation of a new software implementation impact in LPGI Jakarta uses Balanced Scorecard (BSC analysis by making comparison of three-year data. The analysis involves four perspectives of BSC: (1 Financial aspect with the growth of gross premium written (GPW, net premium written (NPW, underwriting profit; (2 internal business aspect: the frequency of policy issued and the average production per policy; (3 people or learning and growth which consists of human error and system error; (4 customer aspect with external endorsement and renewal ratio. This research measures and evaluates for the impact of the implementation of a new software application to the new business performance as Marginal and Fair contribution.  At the end of this paper the writer suggests LPGI Jakarta to increase the sales activities to reach the target which is related directly to financial aspect and internal business process aspect.

  20. Performance evaluation of spectral deconvolution analysis tool (SDAT) software used for nuclear explosion radionuclide measurements

    Foltz Biegalski, K.M.; Biegalski, S.R.; Haas, D.A.

    2008-01-01

    The Spectral Deconvolution Analysis Tool (SDAT) software was developed to improve counting statistics and detection limits for nuclear explosion radionuclide measurements. SDAT utilizes spectral deconvolution spectroscopy techniques and can analyze both β-γ coincidence spectra for radioxenon isotopes and high-resolution HPGe spectra from aerosol monitors. Spectral deconvolution spectroscopy is an analysis method that utilizes the entire signal deposited in a gamma-ray detector rather than the small portion of the signal that is present in one gamma-ray peak. This method shows promise to improve detection limits over classical gamma-ray spectroscopy analytical techniques; however, this hypothesis has not been tested. To address this issue, we performed three tests to compare the detection ability and variance of SDAT results to those of commercial off- the-shelf (COTS) software which utilizes a standard peak search algorithm. (author)

  1. A Business Analytics Software Tool for Monitoring and Predicting Radiology Throughput Performance.

    Jones, Stephen; Cournane, Seán; Sheehy, Niall; Hederman, Lucy

    2016-12-01

    Business analytics (BA) is increasingly being utilised by radiology departments to analyse and present data. It encompasses statistical analysis, forecasting and predictive modelling and is used as an umbrella term for decision support and business intelligence systems. The primary aim of this study was to determine whether utilising BA technologies could contribute towards improved decision support and resource management within radiology departments. A set of information technology requirements were identified with key stakeholders, and a prototype BA software tool was designed, developed and implemented. A qualitative evaluation of the tool was carried out through a series of semi-structured interviews with key stakeholders. Feedback was collated, and emergent themes were identified. The results indicated that BA software applications can provide visibility of radiology performance data across all time horizons. The study demonstrated that the tool could potentially assist with improving operational efficiencies and management of radiology resources.

  2. Studies in Software Cost Model Behavior: Do We Really Understand Cost Model Performance?

    Lum, Karen; Hihn, Jairus; Menzies, Tim

    2006-01-01

    While there exists extensive literature on software cost estimation techniques, industry practice continues to rely upon standard regression-based algorithms. These software effort models are typically calibrated or tuned to local conditions using local data. This paper cautions that current approaches to model calibration often produce sub-optimal models because of the large variance problem inherent in cost data and by including far more effort multipliers than the data supports. Building optimal models requires that a wider range of models be considered while correctly calibrating these models requires rejection rules that prune variables and records and use multiple criteria for evaluating model performance. The main contribution of this paper is to document a standard method that integrates formal model identification, estimation, and validation. It also documents what we call the large variance problem that is a leading cause of cost model brittleness or instability.

  3. On the Tradeoff between Performance and Programmability for Software Defined WiFi Networks

    Tausif Zahid

    2018-01-01

    Full Text Available WiFi has become one of the major network access networks due to its simple technical implementation and high-bandwidth provisioning. In this paper, we studied software defined WiFi networks (SDWN against traditional WiFi networks to understand the potential benefits, such as the ability of SDWN to effectively hide the handover delay between access points (AP of the adoption of the SDWN architecture on WiFi networks and identify representative application scenarios where such SDWN approach could bring additional benefits. This study delineated the performance bottlenecks such as the throughput degradation by around 50% compared with the conventional WiFi networks. In addition, our study also shed some insights into performance optimization issues. All of the performance measurements were conducted on a network testbed consisting of a single basic service set (BSS and an extended service set (ESS managed by a single SDN controller deployed with various laboratory settings. Our evaluation included the throughput performance under different traffic loads with different number of nodes and packet sizes for both TCP and UDP traffic flows. Handover delays were measured during the roaming phase between different APs against the traditional WiFi networks. Our results have demonstrated the tradeoff between performance and programmability of software defined APs.

  4. Implementation and optimization of ultrasound signal processing algorithms on mobile GPU

    Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong

    2014-03-01

    A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNRe., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.

  5. PIConGPU - How to build one of the fastest GPU particle-in-cell codes in the world

    Burau, Heiko; Debus, Alexander; Helm, Anton; Huebl, Axel; Kluge, Thomas; Widera, Rene; Bussmann, Michael; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Juckeland, Guido; Nagel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany); Schmitt, Felix [NVIDIA (United States)

    2013-07-01

    We present the algorithmic building blocks of PIConGPU, one of the fastest implementations of the particle-in-cell algortihm on GPU clusters. PIConGPU is a highly-scalable, 3D3V electromagnetic PIC code that is used in laser plasma and astrophysical plasma simulations.

  6. GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo

    Kim, H; Duchaineau, M; Max, N

    2011-09-21

    We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.

  7. A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions

    Exl, Lukas

    2017-12-01

    An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.

  8. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

    Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL

    2015-01-01

    The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.

  9. TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy

    Kim, S; Suh, T; Yoon, D; Jung, J; Shin, H; Kim, M [The catholic university of Korea, Seoul (Korea, Republic of)

    2016-06-15

    Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.

  10. Impact Analysis of Generalized Audit Software (GAS Utilization to Auditor Performances

    Aries Wicaksono

    2016-09-01

    Full Text Available This study aimed to understand whether the use of Generalized Audit Software (GAS in the audit process had an impact on the auditors performance and to acquire conclusions in the evaluation form towards GAS audit process to provide a positive impact on the performance of auditors. The models used to evaluate the impact of GAS were Quantity of Work, Quality of Work, Job Knowledge, Creativeness, Cooperation, Dependability, Initiative, and Personal Qualities. The method used in this research was a qualitative method of analytical descriptive and evaluative, by analyzing the impact of the GAS implementation to the components of the user’s performance. The results indicate that the use of GAS has a positive impact on user’s performance components.

  11. A Framework for Performing V&V within Reuse-Based Software Engineering

    Addy, Edward A.

    1996-01-01

    Verification and validation (V&V) is performed during application development for many systems, especially safety-critical and mission-critical systems. The V&V process is intended to discover errors, especially errors related to critical processing, as early as possible during the development process. Early discovery is important in order to minimize the cost and other impacts of correcting these errors. In order to provide early detection of errors, V&V is conducted in parallel with system development, often beginning with the concept phase. In reuse-based software engineering, however, decisions on the requirements, design and even implementation of domain assets can be made prior to beginning development of a specific system. In this case, V&V must be performed during domain engineering in order to have an impact on system development. This paper describes a framework for performing V&V within architecture-centric, reuse-based software engineering. This framework includes the activities of traditional application-level V&V, and extends these activities into domain engineering and into the transition between domain engineering and application engineering. The framework includes descriptions of the types of activities to be performed during each of the life-cycle phases, and provides motivation for the activities.

  12. Improving Utility of GPU in Accelerating Industrial Applications with User-centred Automatic Code Translation

    Yang, Po; Dong, Feng; Codreanu, Valeriu

    2018-01-01

    design and hard-to-use. Little attentions have been paid to the applicability, usability and learnability of these tools for normal users. In this paper, we present an online automated CPU-to-GPU source translation system, (GPSME) for inexperienced users to utilize GPU capability in accelerating general...... SME applications. This system designs and implements a directive programming model with new kernel generation scheme and memory management hierarchy to optimize its performance. A web service interface is designed for inexperienced users to easily and flexibly invoke the automatic resource translator...

  13. CUDA GPU based full-Stokes finite difference modelling of glaciers

    Brædstrup, Christian; Egholm, D.L.

    advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large scale scientific com- putations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore...... to minimize the short wavelength errors efficiently. This reduces the iteration count by several orders of magnitude. The run-time is further reduced by using the GPGPU technology where each card has up to 448 cores. Researchers utilizing the GPGPU technique in other areas have reported between 2 - 11 times...

  14. GPU Acceleration of DSP for Communication Receivers.

    Gunther, Jake; Gunther, Hyrum; Moon, Todd

    2017-09-01

    Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.

  15. Quick plasma equilibrium reconstruction based on GPU

    Xiao Bingjia; Huang, Y.; Luo, Z.P.; Yuan, Q.P.; Lao, L.

    2014-01-01

    A parallel code named P-EFIT which could complete an equilibrium reconstruction iteration in 250 μs is described. It is built with the CUDA TM architecture by using Graphical Processing Unit (GPU). It is described for the optimization of middle-scale matrix multiplication on GPU and an algorithm which could solve block tri-diagonal linear system efficiently in parallel. Benchmark test is conducted. Static test proves the accuracy of the P-EFIT and simulation-test proves the feasibility of using P-EFIT for real-time reconstruction on 65x65 computation grids. (author)

  16. GPU Pro 4 advanced rendering techniques

    Engel, Wolfgang

    2013-01-01

    GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi

  17. GPU Pro 5 advanced rendering techniques

    Engel, Wolfgang

    2014-01-01

    In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light

  18. Aspects of GPU perfomance in algorithms with random memory access

    Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.

    2017-10-01

    The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.

  19. A GPU code for analytic continuation through a sampling method

    Johan Nordström

    2016-01-01

    Full Text Available We here present a code for performing analytic continuation of fermionic Green’s functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU. The code is based on the sampling method introduced by Mishchenko et al. (2000, and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.

  20. GPU-accelerated 3D neutron diffusion code based on finite difference method

    Xu, Q.; Yu, G.; Wang, K. [Dept. of Engineering Physics, Tsinghua Univ. (China)

    2012-07-01

    Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)

  1. GPU-accelerated 3D neutron diffusion code based on finite difference method

    Xu, Q.; Yu, G.; Wang, K.

    2012-01-01

    Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)

  2. A Visual Basic simulation software tool for performance analysis of a membrane-based advanced water treatment plant.

    Pal, P; Kumar, R; Srivastava, N; Chaudhuri, J

    2014-02-01

    A Visual Basic simulation software (WATTPPA) has been developed to analyse the performance of an advanced wastewater treatment plant. This user-friendly and menu-driven software is based on the dynamic mathematical model for an industrial wastewater treatment scheme that integrates chemical, biological and membrane-based unit operations. The software-predicted results corroborate very well with the experimental findings as indicated in the overall correlation coefficient of the order of 0.99. The software permits pre-analysis and manipulation of input data, helps in optimization and exhibits performance of an integrated plant visually on a graphical platform. It allows quick performance analysis of the whole system as well as the individual units. The software first of its kind in its domain and in the well-known Microsoft Excel environment is likely to be very useful in successful design, optimization and operation of an advanced hybrid treatment plant for hazardous wastewater.

  3. PIConGPU - A highly-scalable particle-in-cell implementation for GPU clusters

    Bussmann, Michael; Burau, Heiko; Debus, Alexander; Huebl, Axel; Kluge, Thomas; Pausch, Richard; Schmeisser, Nils; Steiniger, Klaus; Widera, Rene; Wyderka, Nikolai; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Schneider, Benjamin [HZDR, Dresden (Germany); TU Dresden (Germany); Schmitt, Felix [NVIDIA, Austin, TX (United States); Grottel, Sebastian; Gumhold, Stefan [TU Dresden (Germany); Juckeland, Guido; Angel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany)

    2013-07-01

    PIConGPU can handle large-scale simulations of laser plasma and astrophysical plasma dynamics on GPU clusters with thousands of GPUs. High data throughput allows to conduct large parameter surveys but makes it necessary to rethink data analysis and look for new ways of analyzing large simulation data sets. The speedup seen on GPUs enables scientists to add physical effects to their code that up until recently have been too computationally demanding. We present recent results obtained with PIConGPU, discuss scaling behaviour, the most important building blocks of the code and new physics modules recently added. In addition we give an outlook on data analysis, resiliance and load balancing with PIConGPU.

  4. Performance Estimation for Hardware/Software codesign using Hierarchical Colored Petri Nets

    Grode, Jesper Nicolai Riis; Madsen, Jan; Jerraya, Ahmed-Amine

    1998-01-01

    This paper presents an approach for abstract modeling of the functional behavior of hardware architectures using Hierarchical Colored Petri Nets (HCPNs). Using HCPNs as architectural models has several advantages such as higher estimation accuracy, higher flexibility, and the need for only one...... estimation tool. This makes the approach very useful for designing component models used for performance estimation in Hardware/Software Codesign frameworks such as the LYCOS system. The paper presents the methodology and rules for designing component models using HCPNs. Two examples of architectural models...

  5. An open-source software program for performing Bonferroni and related corrections for multiple comparisons

    Kyle Lesack

    2011-01-01

    Full Text Available Increased type I error resulting from multiple statistical comparisons remains a common problem in the scientific literature. This may result in the reporting and promulgation of spurious findings. One approach to this problem is to correct groups of P-values for "family-wide significance" using a Bonferroni correction or the less conservative Bonferroni-Holm correction or to correct for the "false discovery rate" with a Benjamini-Hochberg correction. Although several solutions are available for performing this correction through commercially available software there are no widely available easy to use open source programs to perform these calculations. In this paper we present an open source program written in Python 3.2 that performs calculations for standard Bonferroni, Bonferroni-Holm and Benjamini-Hochberg corrections.

  6. Software Tools for Development on the Peregrine System | High-Performance

    Computing | NREL Software Tools for Development on the Peregrine System Software Tools for and manage software at the source code level. Cross-Platform Make and SCons The "Cross-Platform Make" (CMake) package is from Kitware, and SCons is a modern software build tool based on Python

  7. Multimodality imaging and state-of-art GPU technology in discriminating benign from malignant breast lesions on real time decision support system

    Kostopoulos, S; Glotsos, D; Kalatzis, I; Asvestas, P; Cavouras, D; Sidiropoulos, K; Dimitropoulos, N

    2014-01-01

    The aim of this study was to design a pattern recognition system for assisting the diagnosis of breast lesions, using image information from Ultrasound (US) and Digital Mammography (DM) imaging modalities. State-of-art computer technology was employed based on commercial Graphics Processing Unit (GPU) cards and parallel programming. An experienced radiologist outlined breast lesions on both US and DM images from 59 patients employing a custom designed computer software application. Textural features were extracted from each lesion and were used to design the pattern recognition system. Several classifiers were tested for highest performance in discriminating benign from malignant lesions. Classifiers were also combined into ensemble schemes for further improvement of the system's classification accuracy. Following the pattern recognition system optimization, the final system was designed employing the Probabilistic Neural Network classifier (PNN) on the GPU card (GeForce 580GTX) using CUDA programming framework and C++ programming language. The use of such state-of-art technology renders the system capable of redesigning itself on site once additional verified US and DM data are collected. Mixture of US and DM features optimized performance with over 90% accuracy in correctly classifying the lesions

  8. Parallel, distributed and GPU computing technologies in single-particle electron microscopy

    Schmeisser, Martin; Heisen, Burkhard C.; Luettich, Mario; Busche, Boris; Hauer, Florian; Koske, Tobias; Knauber, Karl-Heinz; Stark, Holger

    2009-01-01

    An introduction to the current paradigm shift towards concurrency in software. Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today’s technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined

  9. FastGCN: a GPU accelerated tool for fast gene co-expression networks.

    Meimei Liang

    Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.

  10. A GPU-Accelerated Parameter Interpolation Thermodynamic Integration Free Energy Method.

    Giese, Timothy J; York, Darrin M

    2018-03-13

    There has been a resurgence of interest in free energy methods motivated by the performance enhancements offered by molecular dynamics (MD) software written for specialized hardware, such as graphics processing units (GPUs). In this work, we exploit the properties of a parameter-interpolated thermodynamic integration (PI-TI) method to connect states by their molecular mechanical (MM) parameter values. This pathway is shown to be better behaved for Mg 2+ → Ca 2+ transformations than traditional linear alchemical pathways (with and without soft-core potentials). The PI-TI method has the practical advantage that no modification of the MD code is required to propagate the dynamics, and unlike with linear alchemical mixing, only one electrostatic evaluation is needed (e.g., single call to particle-mesh Ewald) leading to better performance. In the case of AMBER, this enables all the performance benefits of GPU-acceleration to be realized, in addition to unlocking the full spectrum of features available within the MD software, such as Hamiltonian replica exchange (HREM). The TI derivative evaluation can be accomplished efficiently in a post-processing step by reanalyzing the statistically independent trajectory frames in parallel for high throughput. We also show how one can evaluate the particle mesh Ewald contribution to the TI derivative evaluation without needing to perform two reciprocal space calculations. We apply the PI-TI method with HREM on GPUs in AMBER to predict p K a values in double stranded RNA molecules and make comparison with experiments. Convergence to under 0.25 units for these systems required 100 ns or more of sampling per window and coupling of windows with HREM. We find that MM charges derived from ab initio QM/MM fragment calculations improve the agreement between calculation and experimental results.

  11. Enabling Diverse Software Stacks on Supercomputers using High Performance Virtual Clusters.

    Younge, Andrew J. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Pedretti, Kevin [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Grant, Ryan [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Brightwell, Ron [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

    2017-05-01

    While large-scale simulations have been the hallmark of the High Performance Computing (HPC) community for decades, Large Scale Data Analytics (LSDA) workloads are gaining attention within the scientific community not only as a processing component to large HPC simulations, but also as standalone scientific tools for knowledge discovery. With the path towards Exascale, new HPC runtime systems are also emerging in a way that differs from classical distributed com- puting models. However, system software for such capabilities on the latest extreme-scale DOE supercomputing needs to be enhanced to more appropriately support these types of emerging soft- ware ecosystems. In this paper, we propose the use of Virtual Clusters on advanced supercomputing resources to enable systems to support not only HPC workloads, but also emerging big data stacks. Specifi- cally, we have deployed the KVM hypervisor within Cray's Compute Node Linux on a XC-series supercomputer testbed. We also use libvirt and QEMU to manage and provision VMs directly on compute nodes, leveraging Ethernet-over-Aries network emulation. To our knowledge, this is the first known use of KVM on a true MPP supercomputer. We investigate the overhead our solution using HPC benchmarks, both evaluating single-node performance as well as weak scaling of a 32-node virtual cluster. Overall, we find single node performance of our solution using KVM on a Cray is very efficient with near-native performance. However overhead increases by up to 20% as virtual cluster size increases, due to limitations of the Ethernet-over-Aries bridged network. Furthermore, we deploy Apache Spark with large data analysis workloads in a Virtual Cluster, ef- fectively demonstrating how diverse software ecosystems can be supported by High Performance Virtual Clusters.

  12. Operator training and requalification at GPU Nuclear

    Long, R.L.; Barrett, R.J.; Newton, S.L.

    1982-01-01

    The operator training and requalification programs at GPU Nuclear's Oyster Creek (650 MWe BWR) and Three Mile Island-1 (776 MWe PWR) nuclear plants have undergone significant revisions since the Three Mile Island-2 accident. This paper describes the Training and Education organization, the expanded training facilities, including basic principle trainers and replica simulators, and the present operator training and requalification programs

  13. Parallel generation of architecture on the GPU

    Steinberger, Markus

    2014-05-01

    In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.

  14. Graph coarsening and clustering on the GPU

    Fagginger Auer, B.O.; Bisseling, R.H.

    2013-01-01

    Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph

  15. GPU based acceleration of first principles calculation

    Tomono, H; Tsumuraya, K; Aoki, M; Iitaka, T

    2010-01-01

    We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.

  16. GPU Accelerated Surgical Simulators for Complex Morhpology

    Mosegaard, Jesper; Sørensen, Thomas Sangild

    2005-01-01

    a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...

  17. Synthetic Aperture Beamformation using the GPU

    Hansen, Jens Munk; Schaa, Dana; Jensen, Jørgen Arendt

    2011-01-01

    A synthetic aperture ultrasound beamformer is implemented for a GPU using the OpenCL framework. The implementation supports beamformation of either RF signals or complex baseband signals. Transmit and receive apodization can be either parametric or dynamic using a fixed F-number, a reference...

  18. Perprof-py: A Python Package for Performance Profile of Mathematical Optimization Software

    Abel Soares Siqueira

    2016-04-01

    Full Text Available A very important area of research in the field of Mathematical Optimization is the benchmarking of optimization packages to compare solvers. During benchmarking, one usually collects a large amount of information like CPU time, number of functions evaluations, number of iterations, and much more. This information, if presented as tables, can be difficult to analyze and compare due to large amount of data. Therefore tools to better process and understand optimization benchmark data have been developed. One of the most widespread tools is the Performance Profile graphics proposed by Dolan and Moré [2]. In this context, this paper describes perprof-py, a free/open source software that creates 'Performance Profile' graphics. This software produces graphics in PDF using LaTeX with PGF/TikZ [22] and PGFPLOTS [4] packages, in PNG using matplotlib [9], and in HTML using Bokeh [1]. Perprof-py can also be easily extended to be used with other plot libraries. It is implemented in Python 3 with support for internationalization, and is under the General Public License Version 3 (GPLv3.

  19. Software quality assurance in the 1996 performance assessment for the Waste Isolation Pilot Plant

    Froehlich, Gary K.; Ogden, Harvey C.; Byle, Kathleen A.

    2000-01-01

    The US Department of Energy (DOE) Waste Isolation Pilot Plant (WIPP), located in southeast New Mexico, is a deep geologic repository for the permanent disposal of transuranic waste generated by DOE defense-related activities. Sandia National Laboratories (SNL), in its role as scientific advisor to the DOE, is responsible for evaluating the long-term performance of the WIPP. This risk-based Performance Assessment (PA) is accomplished in part through the use of numerous scientific modeling codes, which rely for some of their inputs on data gathered during characterization of the site. The PA is subject to formal requirements set forth in federal regulations. In particular, the components of the calculation fall under the configuration management and software quality assurance aegis of the American Society of Mechanical Engineers(ASME) Nuclear Quality Assurance (NQA) requirements. This paper describes SNL's implementation of the NQA requirements regarding software quality assurance (SQA). The description of the implementation of SQA for a PA calculation addresses not only the interpretation of the NQA requirements, it also discusses roles, deliverables, and the resources necessary for effective implementation. Finally, examples are given which illustrate the effectiveness of SNL's SQA program, followed by a detailed discussion of lessons learned

  20. Software quality assurance in the 1996 performance assessment for the Waste Isolation Pilot Plant

    Froehlich, G.K.; Ogden, H.C.; Byle, K.A.

    2000-01-01

    The US Department of Energy (DOE) Waste Isolation Pilot Plant (WIPP), located in southeast New Mexico, is a deep geologic repository for the permanent disposal of transuranic waste generated by DOE defense-related activities. Sandia National Laboratories (SNL), in its role as scientific advisor to the DOE, is responsible for evaluating the long-term performance of the WIPP. This risk-based Performance Assessment (PA) is accomplished in part through the use of numerous scientific modeling codes, which rely for some of their inputs on data gathered during characterization of the site. The PA is subject to formal requirements set forth in federal regulations. In particular, the components of the calculation fall under the configuration management and software quality assurance aegis of the American Society of Mechanical Engineers (ASME) Nuclear Quality Assurance (NQA) requirements. This paper describes SNL's implementation of the NQA requirements regarding software quality assurance (SQA). The description of the implementation of SQA for a PA calculation addresses not only the interpretation of the NQA requirements, it also discusses roles, deliverables, and the resources necessary for effective implementation. Finally, examples are given which illustrate the effectiveness of SNL's SQA program, followed by a detailed discussion of lessons learned

  1. Performance evaluation of the zero-multipole summation method in modern molecular dynamics software.

    Sakuraba, Shun; Fukuda, Ikuo

    2018-05-04

    The zero-multiple summation method (ZMM) is a cutoff-based method for calculating electrostatic interactions in molecular dynamics simulations, utilizing an electrostatic neutralization principle as a physical basis. Since the accuracies of the ZMM have been revealed to be sufficient in previous studies, it is highly desirable to clarify its practical performance. In this paper, the performance of the ZMM is compared with that of the smooth particle mesh Ewald method (SPME), where the both methods are implemented in molecular dynamics software package GROMACS. Extensive performance comparisons against a highly optimized, parameter-tuned SPME implementation are performed for various-sized water systems and two protein-water systems. We analyze in detail the dependence of the performance on the potential parameters and the number of CPU cores. Even though the ZMM uses a larger cutoff distance than the SPME does, the performance of the ZMM is comparable to or better than that of the SPME. This is because the ZMM does not require a time-consuming electrostatic convolution and because the ZMM gains short neighbor-list distances due to the smooth damping feature of the pairwise potential function near the cutoff length. We found, in particular, that the ZMM with quadrupole or octupole cancellation and no damping factor is an excellent candidate for the fast calculation of electrostatic interactions. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.

  2. CPU and GPU (Cuda Template Matching Comparison

    Evaldas Borcovas

    2014-05-01

    Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.

  3. PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

    Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.

    2014-01-01

    We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors

  4. PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

    Kylasa, S.B., E-mail: skylasa@purdue.edu [Department of Elec. and Comp. Eng., Purdue University, West Lafayette, IN 47907 (United States); Aktulga, H.M., E-mail: hmaktulga@lbl.gov [Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, MS 50F-1650, Berkeley, CA 94720 (United States); Grama, A.Y., E-mail: ayg@cs.purdue.edu [Department of Computer Science, Purdue University, West Lafayette, IN 47907 (United States)

    2014-09-01

    We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.

  5. permGPU: Using graphics processing units in RNA microarray association studies

    George Stephen L

    2010-06-01

    Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.

  6. Fully iterative scatter corrected digital breast tomosynthesis using GPU-based fast Monte Carlo simulation and composition ratio update

    Kim, Kyungsang; Ye, Jong Chul, E-mail: jong.ye@kaist.ac.kr [Bio Imaging and Signal Processing Laboratory, Department of Bio and Brain Engineering, KAIST 291, Daehak-ro, Yuseong-gu, Daejeon 34141 (Korea, Republic of); Lee, Taewon; Cho, Seungryong [Medical Imaging and Radiotherapeutics Laboratory, Department of Nuclear and Quantum Engineering, KAIST 291, Daehak-ro, Yuseong-gu, Daejeon 34141 (Korea, Republic of); Seong, Younghun; Lee, Jongha; Jang, Kwang Eun [Samsung Advanced Institute of Technology, Samsung Electronics, 130, Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 443-803 (Korea, Republic of); Choi, Jaegu; Choi, Young Wook [Korea Electrotechnology Research Institute (KERI), 111, Hanggaul-ro, Sangnok-gu, Ansan-si, Gyeonggi-do, 426-170 (Korea, Republic of); Kim, Hak Hee; Shin, Hee Jung; Cha, Joo Hee [Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro, 43-gil, Songpa-gu, Seoul, 138-736 (Korea, Republic of)

    2015-09-15

    Purpose: In digital breast tomosynthesis (DBT), scatter correction is highly desirable, as it improves image quality at low doses. Because the DBT detector panel is typically stationary during the source rotation, antiscatter grids are not generally compatible with DBT; thus, a software-based scatter correction is required. This work proposes a fully iterative scatter correction method that uses a novel fast Monte Carlo simulation (MCS) with a tissue-composition ratio estimation technique for DBT imaging. Methods: To apply MCS to scatter estimation, the material composition in each voxel should be known. To overcome the lack of prior accurate knowledge of tissue composition for DBT, a tissue-composition ratio is estimated based on the observation that the breast tissues are principally composed of adipose and glandular tissues. Using this approximation, the composition ratio can be estimated from the reconstructed attenuation coefficients, and the scatter distribution can then be estimated by MCS using the composition ratio. The scatter estimation and image reconstruction procedures can be performed iteratively until an acceptable accuracy is achieved. For practical use, (i) the authors have implemented a fast MCS using a graphics processing unit (GPU), (ii) the MCS is simplified to transport only x-rays in the energy range of 10–50 keV, modeling Rayleigh and Compton scattering and the photoelectric effect using the tissue-composition ratio of adipose and glandular tissues, and (iii) downsampling is used because the scatter distribution varies rather smoothly. Results: The authors have demonstrated that the proposed method can accurately estimate the scatter distribution, and that the contrast-to-noise ratio of the final reconstructed image is significantly improved. The authors validated the performance of the MCS by changing the tissue thickness, composition ratio, and x-ray energy. The authors confirmed that the tissue-composition ratio estimation was quite

  7. CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

    Federico Raimondo

    2012-01-01

    Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.

  8. A GPU-accelerated implicit meshless method for compressible flows

    Zhang, Jia-Le; Ma, Zhi-Hua; Chen, Hong-Quan; Cao, Cheng

    2018-05-01

    This paper develops a recently proposed GPU based two-dimensional explicit meshless method (Ma et al., 2014) by devising and implementing an efficient parallel LU-SGS implicit algorithm to further improve the computational efficiency. The capability of the original 2D meshless code is extended to deal with 3D complex compressible flow problems. To resolve the inherent data dependency of the standard LU-SGS method, which causes thread-racing conditions destabilizing numerical computation, a generic rainbow coloring method is presented and applied to organize the computational points into different groups by painting neighboring points with different colors. The original LU-SGS method is modified and parallelized accordingly to perform calculations in a color-by-color manner. The CUDA Fortran programming model is employed to develop the key kernel functions to apply boundary conditions, calculate time steps, evaluate residuals as well as advance and update the solution in the temporal space. A series of two- and three-dimensional test cases including compressible flows over single- and multi-element airfoils and a M6 wing are carried out to verify the developed code. The obtained solutions agree well with experimental data and other computational results reported in the literature. Detailed analysis on the performance of the developed code reveals that the developed CPU based implicit meshless method is at least four to eight times faster than its explicit counterpart. The computational efficiency of the implicit method could be further improved by ten to fifteen times on the GPU.

  9. Cobalt: A GPU-based correlator and beamformer for LOFAR

    Broekema, P. Chris; Mol, J. Jan David; Nijboer, R.; van Amesfoort, A. S.; Brentjens, M. A.; Loose, G. Marcel; Klijn, W. F. A.; Romein, J. W.

    2018-04-01

    For low-frequency radio astronomy, software correlation and beamforming on general purpose hardware is a viable alternative to custom designed hardware. LOFAR, a new-generation radio telescope centered in the Netherlands with international stations in Germany, France, Ireland, Poland, Sweden and the UK, has successfully used software real-time processors based on IBM Blue Gene technology since 2004. Since then, developments in technology have allowed us to build a system based on commercial off-the-shelf components that combines the same capabilities with lower operational cost. In this paper, we describe the design and implementation of a GPU-based correlator and beamformer with the same capabilities as the Blue Gene based systems. We focus on the design approach taken, and show the challenges faced in selecting an appropriate system. The design, implementation and verification of the software system show the value of a modern test-driven development approach. Operational experience, based on three years of operations, demonstrates that a general purpose system is a good alternative to the previous supercomputer-based system or custom-designed hardware.

  10. Forward progress on GPU concurrency

    Donaldson, A.F.; Ketema, J.; Sorensen, T.; Wickerson, J.

    2017-01-01

    The tutorial at CONCUR will provide a practical overview of work undertaken over the last six years in the Multicore Programming Group at Imperial College London, and with collaborators internationally, related to understanding and reasoning about concurrency in software designed for acceleration on

  11. A software package for evaluating the performance of a star sensor operation

    Sarpotdar, Mayuresh; Mathew, Joice; Sreejith, A. G.; Nirmal, K.; Ambily, S.; Prakash, Ajin; Safonova, Margarita; Murthy, Jayant

    2017-02-01

    We have developed a low-cost off-the-shelf component star sensor ( StarSense) for use in minisatellites and CubeSats to determine the attitude of a satellite in orbit. StarSense is an imaging camera with a limiting magnitude of 6.5, which extracts information from star patterns it records in the images. The star sensor implements a centroiding algorithm to find centroids of the stars in the image, a Geometric Voting algorithm for star pattern identification, and a QUEST algorithm for attitude quaternion calculation. Here, we describe the software package to evaluate the performance of these algorithms as a star sensor single operating system. We simulate the ideal case where sky background and instrument errors are omitted, and a more realistic case where noise and camera parameters are added to the simulated images. We evaluate such performance parameters of the algorithms as attitude accuracy, calculation time, required memory, star catalog size, sky coverage, etc., and estimate the errors introduced by each algorithm. This software package is written for use in MATLAB. The testing is parametrized for different hardware parameters, such as the focal length of the imaging setup, the field of view (FOV) of the camera, angle measurement accuracy, distortion effects, etc., and therefore, can be applied to evaluate the performance of such algorithms in any star sensor. For its hardware implementation on our StarSense, we are currently porting the codes in form of functions written in C. This is done keeping in view its easy implementation on any star sensor electronics hardware.

  12. Impact of Recent Hardware and Software Trends on High Performance Transaction Processing and Analytics

    Mohan, C.

    In this paper, I survey briefly some of the recent and emerging trends in hardware and software features which impact high performance transaction processing and data analytics applications. These features include multicore processor chips, ultra large main memories, flash storage, storage class memories, database appliances, field programmable gate arrays, transactional memory, key-value stores, and cloud computing. While some applications, e.g., Web 2.0 ones, were initially built without traditional transaction processing functionality in mind, slowly system architects and designers are beginning to address such previously ignored issues. The availability, analytics and response time requirements of these applications were initially given more importance than ACID transaction semantics and resource consumption characteristics. A project at IBM Almaden is studying the implications of phase change memory on transaction processing, in the context of a key-value store. Bitemporal data management has also become an important requirement, especially for financial applications. Power consumption and heat dissipation properties are also major considerations in the emergence of modern software and hardware architectural features. Considerations relating to ease of configuration, installation, maintenance and monitoring, and improvement of total cost of ownership have resulted in database appliances becoming very popular. The MapReduce paradigm is now quite popular for large scale data analysis, in spite of the major inefficiencies associated with it.

  13. Simulation of pellet-cladding interaction with the Pleiades fuel performance software environment

    Michel, B.; Nonon, C.; Sercombe, J.; Michel, F.; Marelle, V.

    2013-01-01

    This paper focuses on the PLEIADES fuel performance software environment and its application to the modeling of pellet-cladding interaction (PCI). The PLEIADES platform has been under development for 10 yr; a unified software environment, including the multidimensional finite element solver CAST3M, has been used to develop eight computation schemes now under operation. Among the latter, the ALCYONE application is devoted to pressurized water reactor fuel rod behavior. This application provides a three-dimensional (3-D) model for a detailed analysis of fuel element behavior and enables validation through comparing simulation and post-irradiation examination results (cladding residual diameter and ridges, dishing filling, pellet cracking, etc.). These last years the 3-D computation scheme of the ALCYONE application has been enriched with a complete set of physical models to take into account thermomechanical and chemical-physical behavior of the fuel element under irradiation. These models have been validated through the ALCYONE application on a large experimental database composed of approximately 400 study cases. The strong point of the ALCYONE application concerns the local approach of stress-corrosion-cracking rupture under PCI, which can be computed with the 3-D finite element solver. Further developments for PCI modeling in the PLEIADES platform are devoted to a new mesh refinement method for assessing stress-and-strain concentration (multigrid technique) and a new component for assessing fission product chemical recombination. (authors)

  14. Computer software.

    Rosenthal, L E

    1986-10-01

    Software is the component in a computer system that permits the hardware to perform the various functions that a computer system is capable of doing. The history of software and its development can be traced to the early nineteenth century. All computer systems are designed to utilize the "stored program concept" as first developed by Charles Babbage in the 1850s. The concept was lost until the mid-1940s, when modern computers made their appearance. Today, because of the complex and myriad tasks that a computer system can perform, there has been a differentiation of types of software. There is software designed to perform specific business applications. There is software that controls the overall operation of a computer system. And there is software that is designed to carry out specialized tasks. Regardless of types, software is the most critical component of any computer system. Without it, all one has is a collection of circuits, transistors, and silicone chips.

  15. Installing and Setting Up Git Software Tool on Windows | High-Performance

    Computing | NREL Git Software Tool on Windows Installing and Setting Up Git Software Tool on Windows Learn how to set up the Git software tool on Windows for use with the Peregrine system. Git is this doc, we'll show you how to get git installed on Windows 7, and how to get things set up on NREL's

  16. Parameter definition using vibration prediction software leads to significant drilling performance improvements

    Amorim, Dalmo; Hanley, Chris Hanley; Fonseca, Isaac; Santos, Juliana [National Oilwell Varco, Houston TX (United States); Leite, Daltro J.; Borella, Augusto; Gozzi, Danilo [Petroleo Brasileiro S.A. (PETROBRAS), Rio de Janeiro, RJ (Brazil)

    2012-07-01

    The understanding and mitigation of downhole vibration has been a heavily researched subject in the oil industry as it results in more expensive drilling operations, as vibrations significantly diminish the amount of effective drilling energy available to the bit and generate forces that can push the bit or the Bottom Hole Assembly (BHA) off its concentric axis of rotation, producing high magnitude impacts with the borehole wall. In order to drill ahead, a sufficient amount of energy must be supplied by the rig to overcome the resistance of the drilling system, including the reactive torque of the system, drag forces, fluid pressure losses and energy dissipated by downhole vibrations, then providing the bit with the energy required to fail the rock. If the drill string enters resonant modes of vibration, not only does it decreases the amount of available energy to drill, but increases the potential for catastrophic downhole equipment and drilling bit failures. In this sense, the mitigation of downhole vibrations will result in faster, smoother, and cheaper drilling operations. A software tool using Finite Element Analysis (FEA) has been developed to provide better understanding of downhole vibration phenomena in drilling environments. The software tool calculates the response of the drilling system at various input conditions, based on the design of the wellbore along with the geometry of the Bottom Hole Assembly (BHA) and the drill string. It identifies where undesired levels of resonant vibration will be driven by certain combinations of specific drilling parameters, and also which combinations of drilling parameters will result in lower levels of vibration, so the least shocks, the highest penetration rate and the lowest cost per foot can be achieved. With the growing performance of personal computers, complex software systems modeling the drilling vibrations using FEA has been accessible to a wider audience of field users, further complimenting with real time

  17. Parallel computing in cluster of GPU applied to a problem of nuclear engineering

    Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro

    2013-01-01

    Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)

  18. GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.

    Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin

    2017-07-01

    Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.

  19. Collision detection of convex polyhedra on the NVIDIA GPU architecture for the discrete element method

    Govender, Nicolin

    2015-09-01

    Full Text Available consideration due to the architectural differences between CPU and GPU platforms. This paper describes the DEM algorithms and heuristics that are optimized for the parallel NVIDIA Kepler GPU architecture in detail. This includes a GPU optimized collision...

  20. Numerical verification of equilibrium chemistry software within nuclear fuel performance codes

    Piro, M.H.; Lewis, B.J.; Thompson, W.T.; Simunovic, S.; Besmann, T.M.

    2010-01-01

    A numerical tool is in an advanced state of development to compute the equilibrium compositions of phases and their proportions in multi-component systems of importance to the nuclear industry. The resulting software is being conceived for direct integration into large multi-physics fuel performance codes, particularly for providing transport source terms, material properties, and boundary conditions in heat and mass transport modules. Consequently, any numerical errors produced in equilibrium chemistry computations will be propagated in subsequent heat and mass transport calculations, thus falsely predicting nuclear fuel behaviour. The necessity for a reliable method to numerically verify chemical equilibrium computations is emphasized by the requirement to handle the very large number of elements necessary to capture the entire fission product inventory. A simple, reliable and comprehensive numerical verification method called the Gibbs Criteria is presented which can be invoked by any equilibrium chemistry solver for quality assurance purposes. (author)

  1. Study of the performance of the data acquisition chain for BCM1F software upgrade

    Hempel, Maria

    2011-05-15

    BCM1F, the Fast Beam Conditions Monitor, is a sub-detector of the CMS experiment at LHC. It monitors the beam halo and the collision product rates inside the CMS experiment. The data acquisition of BCM1F is independent from CMS. Major components of the BCM1F back-end are discriminators, ADCs, TDCs, look-up tables and a Veto module. In the thesis the performance of several components is investigated. For the TDC two different readout modes are compared, and the impact of a Ring Buffer in the readout software was investigated. For one discriminator the thresholds of all channels are investigated and offsets of about 10 mV are found. Data taken in the LHC runs with the TDC are presented and discussed. Also the application of BCM1F as a luminosity monitor is studied. (orig.)

  2. Study of the performance of the data acquisition chain for BCM1F software upgrade

    Hempel, Maria

    2011-05-01

    BCM1F, the Fast Beam Conditions Monitor, is a sub-detector of the CMS experiment at LHC. It monitors the beam halo and the collision product rates inside the CMS experiment. The data acquisition of BCM1F is independent from CMS. Major components of the BCM1F back-end are discriminators, ADCs, TDCs, look-up tables and a Veto module. In the thesis the performance of several components is investigated. For the TDC two different readout modes are compared, and the impact of a Ring Buffer in the readout software was investigated. For one discriminator the thresholds of all channels are investigated and offsets of about 10 mV are found. Data taken in the LHC runs with the TDC are presented and discussed. Also the application of BCM1F as a luminosity monitor is studied. (orig.)

  3. Automated load balancing in the ATLAS high-performance storage software

    Le Goff, Fabrice; The ATLAS collaboration

    2017-01-01

    The ATLAS experiment collects proton-proton collision events delivered by the LHC accelerator at CERN. The ATLAS Trigger and Data Acquisition (TDAQ) system selects, transports and eventually records event data from the detector at several gigabytes per second. The data are recorded on transient storage before being delivered to permanent storage. The transient storage consists of high-performance direct-attached storage servers accounting for about 500 hard drives. The transient storage operates dedicated software in the form of a distributed multi-threaded application. The workload includes both CPU-demanding and IO-oriented tasks. This paper presents the original application threading model for this particular workload, discussing the load-sharing strategy among the available CPU cores. The limitations of this strategy were reached in 2016 due to changes in the trigger configuration involving a new data distribution pattern. We then describe a novel data-driven load-sharing strategy, designed to automatical...

  4. ATLAS High Level Calorimeter Trigger Software Performance for Cosmic Ray Events

    Oliveira Damazio, Denis; The ATLAS collaboration

    2009-01-01

    The ATLAS detector is undergoing intense commissioning effort with cosmic rays preparing for the first LHC collisions next spring. Combined runs with all of the ATLAS subsystems are being taken in order to evaluate the detector performance. This is an unique opportunity also for the trigger system to be studied with different detector operation modes, such as different event rates and detector configuration. The ATLAS trigger starts with a hardware based system which tries to identify detector regions where interesting physics objects may be found (eg: large energy depositions in the calorimeter system). An approved event will be further processed by more complex software algorithms at the second level where detailed features are extracted (full detector granularity data for small portions of the detector is available). Events accepted at this level will be further processed at the so-called event filter level. Full detector data at full granularity is available for offline like processing with complete calib...

  5. FMEA Performed on the SPINLINE3 Operational System Software as part of the TIHANGE 1 NIS Refurbishment Safety Case

    Ristord, L.; Esmenjaud, C.

    2002-01-01

    This paper introduces the SPINLINE3 technology and TIHANGE 1 the NIS project. It then focuses on the specificity of FMEA performed on software. It points out the benefits of this analysis and also some of the limitations and possible developments. It also gives characteristics that, if present in the software, help the analysis and the defenses. It takes as an example the analysis performed on the Operational System Software of the Schneider Electric safety digital generic platform SPINLINE3. The New TIHANGE 1 Nuclear Instrumentation System successfully started operation on the beginning of Marsh 2001 after the plant outage, as planned at the beginning of the project. The choice of a software-based technology has raised the issue of the risk of CCF due to the same software being used in redundant independent units. Implementing functional diversity or equipment diversity has been considered but found either not practicable or of little value within this context. The safety characteristics of the SPINLINE3 solution and the stringent and proven safety software development process applied by the Nuclear department of the Schneider Electric company have made acceptable the principle of a design based on redundant identical processing units for this project. In addition, because of the possible consequences in case of the NIS not performing its protection function on demand, the licensing authority has required an FMEA oriented toward the SCCF risk as part of the safety case. This FMEA has been performed on : - the NIS architecture, - the SPINLINE3 Operational System Software, - the three Tihange 1 application software (i.e. source, intermediate and power range). The process used and the results have been elaborated by Schneider Electric and reviewed by the customer and the licensing authority all along the project development until final acceptance. Issues have been raised and answers and/or complementary analyses provided, some of them making direct references to the

  6. High-throughput GPU-based LDPC decoding

    Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin

    2010-08-01

    Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.

  7. The Dynamo package for tomography and subtomogram averaging: components for MATLAB, GPU computing and EC2 Amazon Web Services.

    Castaño-Díez, Daniel

    2017-06-01

    Dynamo is a package for the processing of tomographic data. As a tool for subtomogram averaging, it includes different alignment and classification strategies. Furthermore, its data-management module allows experiments to be organized in groups of tomograms, while offering specialized three-dimensional tomographic browsers that facilitate visualization, location of regions of interest, modelling and particle extraction in complex geometries. Here, a technical description of the package is presented, focusing on its diverse strategies for optimizing computing performance. Dynamo is built upon mbtools (middle layer toolbox), a general-purpose MATLAB library for object-oriented scientific programming specifically developed to underpin Dynamo but usable as an independent tool. Its structure intertwines a flexible MATLAB codebase with precompiled C++ functions that carry the burden of numerically intensive operations. The package can be delivered as a precompiled standalone ready for execution without a MATLAB license. Multicore parallelization on a single node is directly inherited from the high-level parallelization engine provided for MATLAB, automatically imparting a balanced workload among the threads in computationally intense tasks such as alignment and classification, but also in logistic-oriented tasks such as tomogram binning and particle extraction. Dynamo supports the use of graphical processing units (GPUs), yielding considerable speedup factors both for native Dynamo procedures (such as the numerically intensive subtomogram alignment) and procedures defined by the user through its MATLAB-based GPU library for three-dimensional operations. Cloud-based virtual computing environments supplied with a pre-installed version of Dynamo can be publicly accessed through the Amazon Elastic Compute Cloud (EC2), enabling users to rent GPU computing time on a pay-as-you-go basis, thus avoiding upfront investments in hardware and longterm software maintenance.

  8. Performance analysis and optimization of an advanced pharmaceutical wastewater treatment plant through a visual basic software tool (PWWT.VB).

    Pal, Parimal; Thakura, Ritwik; Chakrabortty, Sankha

    2016-05-01

    A user-friendly, menu-driven simulation software tool has been developed for the first time to optimize and analyze the system performance of an advanced continuous membrane-integrated pharmaceutical wastewater treatment plant. The software allows pre-analysis and manipulation of input data which helps in optimization and shows the software performance visually on a graphical platform. Moreover, the software helps the user to "visualize" the effects of the operating parameters through its model-predicted output profiles. The software is based on a dynamic mathematical model, developed for a systematically integrated forward osmosis-nanofiltration process for removal of toxic organic compounds from pharmaceutical wastewater. The model-predicted values have been observed to corroborate well with the extensive experimental investigations which were found to be consistent under varying operating conditions like operating pressure, operating flow rate, and draw solute concentration. Low values of the relative error (RE = 0.09) and high values of Willmott-d-index (d will = 0.981) reflected a high degree of accuracy and reliability of the software. This software is likely to be a very efficient tool for system design or simulation of an advanced membrane-integrated treatment plant for hazardous wastewater.

  9. Cost-effective GPU-grid for genome-wide epistasis calculations.

    Pütz, B; Kam-Thong, T; Karbalai, N; Altmann, A; Müller-Myhsok, B

    2013-01-01

    Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account [1]. It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally. The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure. Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost. A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores. The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.

  10. Performance Analysis of Congestion Control Mechanism in Software Defined Network (SDN

    Rahman M. Z. A.

    2017-01-01

    Full Text Available In the near future, the traditional networks architecture will be difficult to be managed. Hence, Software Defined Network (SDN will be an alternative in the future of programmable networks to replace the conventional network architecture. The main idea of SDN architecture is to separate the forwarding plane and control plane of network system, where network operators can program packet forwarding behaviour to improve the network performance. Congestion control is important mechanism for network traffic to improve network capability and achieve high end Quality of Service (QoS. In this paper, extensive simulation is conducted to analyse the performance of SDN by implementing Link Layer Discovery Protocol (LLDP under congested network. The simulation was conducted on Mininet by creating four different fanout and the result was analysed based on differences of matrix performance. As a result, the packet loss and throughput reduction were observed when number of fanout in the topology was increased. By using LLDP protocol, huge reduction in packet loss rate has been achieved while maximizing percentage packet delivery ratio.

  11. Parallel GPU implementation of iterative PCA algorithms.

    Andrecut, M

    2009-11-01

    Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).

  12. GPU seeks new funding for TMI cleanup

    Utroska, D.

    1982-01-01

    General Public Utilities (GPU) wants approval for annual transfer of money from base rate increases in other accounts to pay for the cleanup at Three Mile Island (TMI) until TMI-1 returns to service or the public utility commission takes further action. This proposal confirms fears of a delay in TMI-1 startup and demonstrates that the January negotiated settlement will produce little funding for TMI-2 cleanup. A review of the settlement terms outlines the three-step process for base rate increases and revenue adjustments after the startup of TMI-1, and points out where controversy and delays due to psychological stress make a new source of money essential. GPU thinks customer funding will motivate other parties to a broad-based cost-sharing agreement

  13. GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.

    Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B

    2012-09-11

    In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.

  14. Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

    Cheng-Ying Chou

    Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.

  15. Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing

    Qiang Liu

    2018-05-01

    Full Text Available Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite volume scheme was proposed for flood simulation. To realize a fast simulation of large-scale floods on a personal computer, a Graphics Processing Unit (GPU-based, high-performance computing method using the OpenACC application was adopted to parallelize the shallow water model. An unstructured data management method was presented to control the data transportation between the GPU and CPU (Central Processing Unit with minimum overhead, and then both computation and data were offloaded from the CPU to the GPU, which exploited the computational capability of the GPU as much as possible. The parallel model was validated using various benchmarks and real-world case studies. The results demonstrate that speed-ups of up to one order of magnitude can be achieved in comparison with the serial model. The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.

  16. Large Scale Simulations of the Euler Equations on GPU Clusters

    Liebmann, Manfred

    2010-08-01

    The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.

  17. RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

    Nicholas P. Bailey, Trond S. Ingebrigtsen, Jesper Schmidt Hansen, Arno A. Veldhorst, Lasse Bøhling, Claire A. Lemarchand, Andreas E. Olsen, Andreas K. Bacher, Lorenzo Costigliola, Ulf R. Pedersen, Heine Larsen, Jeppe C. Dyre, Thomas B. Schrøder

    2017-12-01

    Full Text Available RUMD is a general purpose, high-performance molecular dynamics (MD simulation package running on graphical processing units (GPU's. RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up to hundred thousand particles. It has a performance that is comparable to other GPU-MD codes at large system sizes and substantially better at smaller sizes.RUMD is open-source and consists of a library written in C++ and the CUDA extension to C, an easy-to-use Python interface, and a set of tools for set-up and post-simulation data analysis. The paper describes RUMD's main features, optimizations and performance benchmarks.

  18. Solving global optimization problems on GPU cluster

    Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya [Lobachevsky State University of Nizhni Novgorod, Gagarin Avenue 23, 603950 Nizhni Novgorod (Russian Federation)

    2016-06-08

    The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.

  19. Bayer image parallel decoding based on GPU

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  20. On integrating modeling software for application to total-system performance assessment

    Lewis, L.C.; Wilson, M.L.

    1994-05-01

    We examine the processes and methods used to facilitate collaboration in software development between two organizations at separate locations -- Lawrence Livermore National Laboratory (LLNL) in California and Sandia National Laboratories (SNL) in New Mexico. Our software development process integrated the efforts of these two laboratories. Software developed at LLNL to model corrosion and failure of waste packages and subsequent releases of radionuclides was incorporated as a source term into SNLs computer models for fluid flow and radionuclide transport through the geosphere

  1. Systematic approach in optimizing numerical memory-bound kernels on GPU

    Abdelfattah, Ahmad

    2013-01-01

    The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device\\'s architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0. © 2013 Springer-Verlag.

  2. PaRSEC: A Software Framework for Performance and Productivity on Hybrid, Manycore Platforms

    Dongarra, Jack [Univ. of Tennessee, Knoxville, TN (United States)

    2016-06-30

    As the era of computer architectures dominated by serial processors ends, the convergence of several unprecedented challenges suggests that closing the longstanding "application–architecture performance gap" will become more challenging than ever. To address this problem, the Parallel Runtime Scheduling and Execution Control (PaRSEC) project created a modular software framework that achieved two major objectives: first, it built a task-based runtime capable of delivering portable performance to a wide range of science and engineering applications at all levels of the platform pyramid, including the upcoming 100 Pflop/s systems and then exascale; and second, it supported and facilitated the work of developers in migrating their legacy codes and writing entirely new ones for the emerging hybrid and massively parallel manycore processor system designs. PaRSEC will support multiple domain-specific languages capable of increasing the developers' productivity while also providing the runtime with the constructs and flexibility necessary to exploit the maximal parallelism from parallel applications. Extensive preliminary research in dense linear algebra showed convincingly that a parameterized task graph representation that symbolically describes the algorithm content can achieve the project's twofold objective within that domain. The research also strongly suggested that this powerful method could be generalized to a far-wider variety of applications.

  3. Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank.

    Millard, Louise A C; Davies, Neil M; Gaunt, Tom R; Davey Smith, George; Tilling, Kate

    2017-10-05

    Epidemiological cohorts typically contain a diverse set of phenotypes such that automation of phenome scans is non-trivial, because they require highly heterogeneous models. For this reason, phenome scans have to date tended to use a smaller homogeneous set of phenotypes that can be analysed in a consistent fashion. We present PHESANT (PHEnome Scan ANalysis Tool), a software package for performing comprehensive phenome scans in UK Biobank. PHESANT tests the association of a specified trait with all continuous, integer and categorical variables in UK Biobank, or a specified subset. PHESANT uses a novel rule-based algorithm to determine how to appropriately test each trait, then performs the analyses and produces plots and summary tables. The PHESANT phenome scan is implemented in R. PHESANT includes a novel Javascript D3.js visualization and accompanying Java code that converts the phenome scan results to the required JavaScript Object Notation (JSON) format. PHESANT is available on GitHub at [https://github.com/MRCIEU/PHESANT]. Git tag v0.5 corresponds to the version presented here. © The Author 2017. Published by Oxford University Press on behalf of the International Epidemiological Association.

  4. Performance assessment of the commercial CFD software for the prediction of the PWR internal flow - Corrected version

    Lee, Gong Hee; Bang, Young Seok; Woo, Sweng Woong; Cheong, Ae Ju; Kim, Do Hyeong; Kang, Min Ku

    2013-01-01

    As the computer hardware technology develops the license applicants for nuclear power plant use the commercial CFD software with the aim of reducing the excessive conservatism associated with using simplified and conservative analysis tools. Even if some of CFD software developers and its users think that a state of the art CFD software can be used to solve reasonably at least the single-phase nuclear reactor safety problems there is still the limitations and the uncertainties in the calculation result. From a regulatory perspective, Korea Institute of Nuclear Safety (KINS) has been presently conducting the performance assessment of the commercial CFD software for the nuclear reactor safety problems. In this study, in order to examine the prediction performance of the commercial CFD software with the porous model in the analysis of the scale-down APR+ (Advanced Power Reactor Plus) internal flow, simulation was conducted with the on-board numerical models in ANSYS CFX R.14 and FLUENT R.14. It was concluded that depending on the CFD software the internal flow distribution of the scale-down APR+ was locally some-what different. Although there was a limitation in estimating the prediction performance of the commercial CFD software due to the limited number of the measured data, CFXR.14 showed the more reasonable predicted results in comparison with FLUENT R.14. Meanwhile, due to the difference of discretization methodology, FLUENT R.14 required more computational memory than CFX R.14 for the same grid system. Therefore the CFD software suitable to the available computational resource should be selected for the massive parallel computation. (authors)

  5. AFOSR BRI: Co-Design of Hardware/Software for Predicting MAV Aerodynamics

    2016-09-27

    fold was extracted when applying architecture -aware GPU optimizations, resulting in a 371-fold speed-up. By also leveraging algorithmic innovation...mind the strengths of the underlying hardware architecture . Some examples include a block-sparse linear solver. • Characterization of performance...GPU, NVIDIA GPU, and Intel Xeon Phi . • Creation of a prototypical runtime system called CoreTSAR, short for Core Task-Size Adapting Runtime, that

  6. Software Accelerates Computing Time for Complex Math

    2014-01-01

    Ames Research Center awarded Newark, Delaware-based EM Photonics Inc. SBIR funding to utilize graphic processing unit (GPU) technology- traditionally used for computer video games-to develop high-computing software called CULA. The software gives users the ability to run complex algorithms on personal computers with greater speed. As a result of the NASA collaboration, the number of employees at the company has increased 10 percent.

  7. GPU-based fast pencil beam algorithm for proton therapy

    Fujimoto, Rintaro; Nagamine, Yoshihiko; Kurihara, Tsuneya

    2011-01-01

    Performance of a treatment planning system is an essential factor in making sophisticated plans. The dose calculation is a major time-consuming process in planning operations. The standard algorithm for proton dose calculations is the pencil beam algorithm which produces relatively accurate results, but is time consuming. In order to shorten the computational time, we have developed a GPU (graphics processing unit)-based pencil beam algorithm. We have implemented this algorithm and calculated dose distributions in the case of a water phantom. The results were compared to those obtained by a traditional method with respect to the computational time and discrepancy between the two methods. The new algorithm shows 5-20 times faster performance using the NVIDIA GeForce GTX 480 card in comparison with the Intel Core-i7 920 processor. The maximum discrepancy of the dose distribution is within 0.2%. Our results show that GPUs are effective for proton dose calculations.

  8. Commercial Building Energy Baseline Modeling Software: Performance Metrics and Method Testing with Open Source Models and Implications for Proprietary Software Testing

    Price, Phillip N.; Granderson, Jessica; Sohn, Michael; Addy, Nathan; Jump, David

    2013-09-01

    The overarching goal of this work is to advance the capabilities of technology evaluators in evaluating the building-level baseline modeling capabilities of Energy Management and Information System (EMIS) software. Through their customer engagement platforms and products, EMIS software products have the potential to produce whole-building energy savings through multiple strategies: building system operation improvements, equipment efficiency upgrades and replacements, and inducement of behavioral change among the occupants and operations personnel. Some offerings may also automate the quantification of whole-building energy savings, relative to a baseline period, using empirical models that relate energy consumption to key influencing parameters, such as ambient weather conditions and building operation schedule. These automated baseline models can be used to streamline the whole-building measurement and verification (M&V) process, and therefore are of critical importance in the context of multi-measure whole-building focused utility efficiency programs. This report documents the findings of a study that was conducted to begin answering critical questions regarding quantification of savings at the whole-building level, and the use of automated and commercial software tools. To evaluate the modeling capabilities of EMIS software particular to the use case of whole-building savings estimation, four research questions were addressed: 1. What is a general methodology that can be used to evaluate baseline model performance, both in terms of a) overall robustness, and b) relative to other models? 2. How can that general methodology be applied to evaluate proprietary models that are embedded in commercial EMIS tools? How might one handle practical issues associated with data security, intellectual property, appropriate testing ‘blinds’, and large data sets? 3. How can buildings be pre-screened to identify those that are the most model-predictable, and therefore those

  9. Performance Test of Openflow Agent on Openflow Software-Based Mikrotik RB750 Switch

    Rikie Kartadie

    2016-11-01

    Full Text Available A network is usually developed by several devices such as router, switch etc. Every device forwards data package manipulation with complicated protocol planted in its hardware. An operator is responsible for running configuration either to manage rules or application applied in the network. Human error may occur when device configuration run manually by operator. Some famous vendors, one of them is MikroTik, has also been implementing this OpenFlow on its operation. It provides the implementation of SDN/OpenFlow architecture with affordable cost. The second phase research result showed that switch OF software-based MikroTik resulted higher latency value than both mininet and switch OF software-based OpenWRT. The average gap value of switch OF software-based MikroTik is 2012 kbps lower than the value of switch OF software-based OpenWRT. The average gap value of throughput bandwidth protocol UDP switch OF software-based MikroTik is 3.6176 kBps lower than switch OF software-based OpenWRT and it is 8.68 kBps lower than mininet. The average gap throughput jitter protokol UDP of switch OF software-based MiktoTik is 0.0103ms lower than switch OF software-based OpenWRT and 0.0093ms lower than mininet. 

  10. Performance of a Code Migration for the Simulation of Supersonic Ejector Flow to SMP, MIC, and GPU Using OpenMP, OpenMP+LEO, and OpenACC Directives

    C. Couder-Castañeda

    2015-01-01

    Full Text Available A serial source code for simulating a supersonic ejector flow is accelerated using parallelization based on OpenMP and OpenACC directives. The purpose is to reduce the development costs and to simplify the maintenance of the application due to the complexity of the FORTRAN source code. This research follows well-proven strategies in order to obtain the best performance in both OpenMP and OpenACC. OpenMP has become the programming standard for scientific multicore software and OpenACC is one true alternative for graphics accelerators without the need of programming low level kernels. The strategies using OpenMP are oriented towards reducing the creation of parallel regions, tasks creation to handle boundary conditions, and a nested control of the loop time for the programming in offload mode specifically for the Xeon Phi. In OpenACC, the strategy focuses on maintaining the data regions among the executions of the kernels. Experiments for performance and validation are conducted here on a 12-core Xeon CPU, Xeon Phi 5110p, and Tesla C2070, obtaining the best performance from the latter. The Tesla C2070 presented an acceleration factor of 9.86X, 1.6X, and 4.5X compared against the serial version on CPU, 12-core Xeon CPU, and Xeon Phi, respectively.

  11. Performance verification of network function virtualization in software defined optical transport networks

    Zhao, Yongli; Hu, Liyazhou; Wang, Wei; Li, Yajie; Zhang, Jie

    2017-01-01

    With the continuous opening of resource acquisition and application, there are a large variety of network hardware appliances deployed as the communication infrastructure. To lunch a new network application always implies to replace the obsolete devices and needs the related space and power to accommodate it, which will increase the energy and capital investment. Network function virtualization1 (NFV) aims to address these problems by consolidating many network equipment onto industry standard elements such as servers, switches and storage. Many types of IT resources have been deployed to run Virtual Network Functions (vNFs), such as virtual switches and routers. Then how to deploy NFV in optical transport networks is a of great importance problem. This paper focuses on this problem, and gives an implementation architecture of NFV-enabled optical transport networks based on Software Defined Optical Networking (SDON) with the procedure of vNFs call and return. Especially, an implementation solution of NFV-enabled optical transport node is designed, and a parallel processing method for NFV-enabled OTN nodes is proposed. To verify the performance of NFV-enabled SDON, the protocol interaction procedures of control function virtualization and node function virtualization are demonstrated on SDON testbed. Finally, the benefits and challenges of the parallel processing method for NFV-enabled OTN nodes are simulated and analyzed.

  12. Design of the Jet Performance Software for the ATLAS Experiment at LHC

    Doglioni, C; The ATLAS collaboration; Loch, P; Perez, K; Vitillo, RA

    2011-01-01

    This paper describes the design and implementation of the JetFramework, a software tool developed for the data analysis of the ATLAS experi- ment at CERN. JetFramework is based on Athena, an object oriented framework for data processing. The JetFramework Athena package im- plements a configurable data-flow graph (DFG) to represent an analysis. Each node of the graph can perform some computation on one or more particle collections in input. A standard set of nodes to retrieve, filter, sort and plot collections are provided. Users can also implement their own computation units inheriting from a generic interface. The analysis graph can be declared and configured in an Athena options file. To provide the requested flexibility to configure nodes from a configuration file, a sim- ple expression language permits to specify selection and plotting criterias. Viewing an analysis as an explicit DFG permits end-users to avoid writing code for repetitive tasks and to reuse user-defined computation units in other analysis...

  13. Minerva: using a software program to improve resident performance during independent call

    Itri, Jason N.; Redfern, Regina O.; Cook, Tessa; Scanlon, Mary H.

    2010-03-01

    We have developed an application called Minerva that allows tracking of resident discrepancy rates and missed cases. Minerva mines the radiology information system (RIS) for preliminary interpretations provided by residents during independent call and copies both the preliminary and final interpretations to a database. Both versions are displayed for direct comparison by Minerva and classified as 'in agreement', 'minor discrepancy' or 'major discrepancy' by the resident program director. Minerva compiles statistics comparing minor, major and total discrepancy rates for individual residents relative to the overall group. Discrepant cases are categorized according to date, modality and body part and reviewed for trends in missed cases. The rate of minor, major and total discrepancies for residents on-call at our institution was similar to rates previously published, including a 2.4% major discrepancy rate for second year radiology residents in the DePICTORS study and a 2.6% major discrepancy rate for resident at a community hospital. Trend analysis of missed cases was used to generate a topic-specific resident missed case conference on acromioclavicular (AC) joint separation injuries, which resulted in a 75% decrease in the number of missed cases related to AC separation subsequent to the conference. Using a software program to track of minor and major discrepancy rates for residents taking independent call using modified RadPeer scoring guidelines provides a competency-based metric to determine resident performance. Topic-specific conferences using the cases identified by Minerva can result in a decrease in missed cases.

  14. GPU-based simulation of optical propagation through turbulence for active and passive imaging

    Monnier, Goulven; Duval, François-Régis; Amram, Solène

    2014-10-01

    IMOTEP is a GPU-based (Graphical Processing Units) software relying on a fast parallel implementation of Fresnel diffraction through successive phase screens. Its applications include active imaging, laser telemetry and passive imaging through turbulence with anisoplanatic spatial and temporal fluctuations. Thanks to parallel implementation on GPU, speedups ranging from 40X to 70X are achieved. The present paper gives a brief overview of IMOTEP models, algorithms, implementation and user interface. It then focuses on major improvements recently brought to the anisoplanatic imaging simulation method. Previously, we took advantage of the computational power offered by the GPU to develop a simulation method based on large series of deterministic realisations of the PSF distorted by turbulence. The phase screen propagation algorithm, by reproducing higher moments of the incident wavefront distortion, provides realistic PSFs. However, we first used a coarse gaussian model to fit the numerical PSFs and characterise there spatial statistics through only 3 parameters (two-dimensional displacements of centroid and width). Meanwhile, this approach was unable to reproduce the effects related to the details of the PSF structure, especially the "speckles" leading to prominent high-frequency content in short-exposure images. To overcome this limitation, we recently implemented a new empirical model of the PSF, based on Principal Components Analysis (PCA), ought to catch most of the PSF complexity. The GPU implementation allows estimating and handling efficiently the numerous (up to several hundreds) principal components typically required under the strong turbulence regime. A first demanding computational step involves PCA, phase screen propagation and covariance estimates. In a second step, realistic instantaneous images, fully accounting for anisoplanatic effects, are quickly generated. Preliminary results are presented.

  15. GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU.

    Braak, van den G.J.W.; Nugteren, C.; Mesman, B.; Corporaal, H.; Kaklamanis, C.; Papatheodorou, T.; Spirakis, P.G.

    2012-01-01

    Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to

  16. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

    Abdelfattah, Ahmad

    2016-05-11

    KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS efficiently runs on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Another optimization technique allows ensuring coalesced memory access when dealing with submatrices, especially for high-level dense linear algebra algorithms. All KBLAS kernels have been leveraged to a multi-GPU environment, which requires the introduction of new APIs. Considering general matrices, KBLAS is very competitive with existing state-of-the-art kernels and provides a smoother performance across a wide range of matrix dimensions. Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50% and 60% speedup against the best competitor on single GPU and multi-GPUs systems, respectively. Performance results also validate our performance model. A subset of KBLAS highperformance kernels have been integrated into NVIDIA\\'s standard BLAS implementation (cuBLAS) for larger dissemination, starting from version 6.0. © 2016 ACM.

  17. Contributions to large scale and performance tests of the ATLAS online software

    Badescu, E.; Caprini, M.

    2003-01-01

    One of the sub-system of the Trigger/DAQ system of the future ATLAS experiment is the Online Software system. It encompasses the functionality needed to configure, control and monitor the DAQ. Its architecture is based on a component structure described in the ATLAS Trigger/DAQ technical proposal. Online Software is responsible for control, supervision and internal communication, excluding the event data flow. For the final ATLAS experiment in 2006 it is expected that it will have to control up to 1000 processors. The core components are the run control, process manager, configuration database, inter process communication, message reporting system and information exchange system. The auxiliary components, namely resource manager, online bookkeeper and the integrated graphical user interface were in use for tests. All the components are unit tested for functionality, fault tolerance, performance and scalability. Extended functionality tests are performed at CERN and remote institutes before each official release. The test objective was the verification of the scalability of the system to a configuration containing a large number of nodes. The aim was to study the interaction between the components, to identify critical areas and to investigate the variation and optimization of online system parameters. The timing of the data acquisition transition phases were recorded and analysed. The information on all processes and their relationships, the run control hierarchy in the online system as well as startup and shutdown dependencies are defined in the configuration database data file. Timing measurements were performed for the transitions shown in the paper and defined as follows: Setup: start online server infrastructure; Close: remove online infrastructure; Boot: start all supervised processes; Shutdown: stop all supervised processes; Cold start: start the supervised processes and go to the Running state; Cold stop: reverse of the cold start phase; Luke warm start

  18. Comparison of GPU-Based Numerous Particles Simulation and Experiment

    Park, Sang Wook; Jun, Chul Woong; Sohn, Jeong Hyun; Lee, Jae Wook

    2014-01-01

    The dynamic behavior of numerous grains interacting with each other can be easily observed. In this study, this dynamic behavior was analyzed based on the contact between numerous grains. The discrete element method was used for analyzing the dynamic behavior of each particle and the neighboring-cell algorithm was employed for detecting their contact. The Hertzian and tangential sliding friction contact models were used for calculating the contact force acting between the particles. A GPU-based parallel program was developed for conducting the computer simulation and calculating the numerous contacts. The dam break experiment was performed to verify the simulation results. The reliability of the program was verified by comparing the results of the simulation with those of the experiment

  19. GPU-computing in econophysics and statistical physics

    Preis, T.

    2011-03-01

    A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.

  20. A GPU-based mipmapping method for water surface visualization

    Li, Hua; Quan, Wei; Xu, Chao; Wu, Yan

    2018-03-01

    Visualization of water surface is a hot topic in computer graphics. In this paper, we presented a fast method to generate wide range of water surface with good image quality both near and far from the viewpoint. This method utilized uniform mesh and Fractal Perlin noise to model water surface. Mipmapping technology was enforced to the surface textures, which adjust the resolution with respect to the distance from the viewpoint and reduce the computing cost. Lighting effect was computed based on shadow mapping technology, Snell's law and Fresnel term. The render pipeline utilizes a CPU-GPU shared memory structure, which improves the rendering efficiency. Experiment results show that our approach visualizes water surface with good image quality at real-time frame rates performance.

  1. Application of optimization methods for nuclear energy system performance assessment by the MESSAGE software

    Andrianov, A.A.; Kuptsov, I.S.; Utyanskaya, T.V.

    2016-01-01

    This paper defines the multi-objective optimization and uncertainty treatment modules for the IAEA energy planning software MESSAGE intended for multi-objective optimization and sustainability assessments of innovative nuclear energy systems with account of uncertainty [ru

  2. High energy electromagnetic particle transportation on the GPU

    Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN

    2014-01-01

    We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.

  3. A cache-friendly sampling strategy for texture-based volume rendering on GPU

    Junpeng Wang

    2017-06-01

    Full Text Available The texture-based volume rendering is a memory-intensive algorithm. Its performance relies heavily on the performance of the texture cache. However, most existing texture-based volume rendering methods blindly map computational resources to texture memory and result in incoherent memory access patterns, causing low cache hit rates in certain cases. The distance between samples taken by threads of an atomic scheduling unit (e.g. a warp of 32 threads in CUDA of the GPU is a crucial factor that affects the texture cache performance. Based on this fact, we present a new sampling strategy, called Warp Marching, for the ray-casting algorithm of texture-based volume rendering. The effects of different sample organizations and different thread-pixel mappings in the ray-casting algorithm are thoroughly analyzed. Also, a pipeline manner color blending approach is introduced and the power of warp-level GPU operations is leveraged to improve the efficiency of parallel executions on the GPU. In addition, the rendering performance of the Warp Marching is view-independent, and it outperforms existing empty space skipping techniques in scenarios that need to render large dynamic volumes in a low resolution image. Through a series of micro-benchmarking and real-life data experiments, we rigorously analyze our sampling strategies and demonstrate significant performance enhancements over existing sampling methods.

  4. MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure

    Kim, Jihoon; Levy, Eric; Ferbrache, Alex; Stepanowsky, Petra; Farcas, Claudiu; Wang, Shuang; Brunner, Stefan; Bath, Tyler; Wu, Yuan; Ohno-Machado, Lucila

    2014-01-01

    Summary: MAGI is a web service for fast MicroRNA-Seq data analysis in a graphics processing unit (GPU) infrastructure. Using just a browser, users have access to results as web reports in just a few hours—>600% end-to-end performance improvement over state of the art. MAGI’s salient features are (i) transfer of large input files in native FASTA with Qualities (FASTQ) format through drag-and-drop operations, (ii) rapid prediction of microRNA target genes leveraging parallel computing with GPU ...

  5. Implementation and Comparison of the Lifting 5/3 and 9/7 Algorithms in MatLab on GPU

    Randa Khemiri

    2016-06-01

    Full Text Available In order to accelerate the Discrete Wavelet Transform DWT, we have implemented and compared the lifting "Le Gall5/3" and "Cohen-Daubechies-Feauveau9/7" (CDF9/7 algorithms on a low cost NVIDIA’s GPU. The suggested implementation is realized in MatLab using the in-house parallel computation toolbox (PCT. Our experimental results indicate, that the speedup is proportional to the image size until it attains a maximum at 20482 pixels, beyond these values the curve decreases. The performance with GPU enhances above a factor of 2~3 compared with CPU.

  6. AGSuite: Software to conduct feature analysis of artificial grammar learning performance.

    Cook, Matthew T; Chubala, Chrissy M; Jamieson, Randall K

    2017-10-01

    To simplify the problem of studying how people learn natural language, researchers use the artificial grammar learning (AGL) task. In this task, participants study letter strings constructed according to the rules of an artificial grammar and subsequently attempt to discriminate grammatical from ungrammatical test strings. Although the data from these experiments are usually analyzed by comparing the mean discrimination performance between experimental conditions, this practice discards information about the individual items and participants that could otherwise help uncover the particular features of strings associated with grammaticality judgments. However, feature analysis is tedious to compute, often complicated, and ill-defined in the literature. Moreover, the data violate the assumption of independence underlying standard linear regression models, leading to Type I error inflation. To solve these problems, we present AGSuite, a free Shiny application for researchers studying AGL. The suite's intuitive Web-based user interface allows researchers to generate strings from a database of published grammars, compute feature measures (e.g., Levenshtein distance) for each letter string, and conduct a feature analysis on the strings using linear mixed effects (LME) analyses. The LME analysis solves the inflation of Type I errors that afflicts more common methods of repeated measures regression analysis. Finally, the software can generate a number of graphical representations of the data to support an accurate interpretation of results. We hope the ease and availability of these tools will encourage researchers to take full advantage of item-level variance in their datasets in the study of AGL. We moreover discuss the broader applicability of the tools for researchers looking to conduct feature analysis in any field.

  7. Better Faster Noise with the GPU

    Wyvill, Geoff; Frisvad, Jeppe Revall

    Filtered noise [Perlin 1985] has, for twenty years, been a fundamental tool for creating functional texture and it has many other applications; for example, animating water waves or the motion of grass waving in the wind. Perlin noise suffers from a number of defects and there have been many atte...... attempts to create better or faster noise but Perlin’s ‘Gradient Noise’ has consistently proved to be the best compromise between speed and quality. Our objective was to create a better noise cheaply by use of the GPU....

  8. A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

    Liu, T.; Ding, A.; Ji, W.; Xu, X. G. [Nuclear Engineering and Engineering Physics, Rensselaer Polytechnic Inst., Troy, NY 12180 (United States); Carothers, C. D. [Dept. of Computer Science, Rensselaer Polytechnic Inst. RPI (United States); Brown, F. B. [Los Alamos National Laboratory (LANL) (United States)

    2012-07-01

    Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)

  9. A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

    Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.

    2012-01-01

    Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of ∼2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)

  10. Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

    Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol

    2013-04-09

    A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.

  11. Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease.

    Shamonin, Denis P; Bron, Esther E; Lelieveldt, Boudewijn P F; Smits, Marion; Klein, Stefan; Staring, Marius

    2013-01-01

    Nonrigid image registration is an important, but time-consuming task in medical image analysis. In typical neuroimaging studies, multiple image registrations are performed, i.e., for atlas-based segmentation or template construction. Faster image registration routines would therefore be beneficial. In this paper we explore acceleration of the image registration package elastix by a combination of several techniques: (i) parallelization on the CPU, to speed up the cost function derivative calculation; (ii) parallelization on the GPU building on and extending the OpenCL framework from ITKv4, to speed up the Gaussian pyramid computation and the image resampling step; (iii) exploitation of certain properties of the B-spline transformation model; (iv) further software optimizations. The accelerated registration tool is employed in a study on diagnostic classification of Alzheimer's disease and cognitively normal controls based on T1-weighted MRI. We selected 299 participants from the publicly available Alzheimer's Disease Neuroimaging Initiative database. Classification is performed with a support vector machine based on gray matter volumes as a marker for atrophy. We evaluated two types of strategies (voxel-wise and region-wise) that heavily rely on nonrigid image registration. Parallelization and optimization resulted in an acceleration factor of 4-5x on an 8-core machine. Using OpenCL a speedup factor of 2 was realized for computation of the Gaussian pyramids, and 15-60 for the resampling step, for larger images. The voxel-wise and the region-wise classification methods had an area under the receiver operator characteristic curve of 88 and 90%, respectively, both for standard and accelerated registration. We conclude that the image registration package elastix was substantially accelerated, with nearly identical results to the non-optimized version. The new functionality will become available in the next release of elastix as open source under the BSD license.

  12. Fast Parallel Image Registration on CPU and GPU for Diagnostic Classification of Alzheimer's Disease

    Denis P Shamonin

    2014-01-01

    Full Text Available Nonrigid image registration is an important, but time-consuming taskin medical image analysis. In typical neuroimaging studies, multipleimage registrations are performed, i.e. for atlas-based segmentationor template construction. Faster image registration routines wouldtherefore be beneficial.In this paper we explore acceleration of the image registrationpackage elastix by a combination of several techniques: iparallelization on the CPU, to speed up the cost function derivativecalculation; ii parallelization on the GPU building on andextending the OpenCL framework from ITKv4, to speed up the Gaussianpyramid computation and the image resampling step; iii exploitationof certain properties of the B-spline transformation model; ivfurther software optimizations.The accelerated registration tool is employed in a study ondiagnostic classification of Alzheimer's disease and cognitivelynormal controls based on T1-weighted MRI. We selected 299participants from the publicly available Alzheimer's DiseaseNeuroimaging Initiative database. Classification is performed with asupport vector machine based on gray matter volumes as a marker foratrophy. We evaluated two types of strategies (voxel-wise andregion-wise that heavily rely on nonrigid image registration.Parallelization and optimization resulted in an acceleration factorof 4-5x on an 8-core machine. Using OpenCL a speedup factor of ~2was realized for computation of the Gaussian pyramids, and 15-60 forthe resampling step, for larger images. The voxel-wise and theregion-wise classification methods had an area under thereceiver operator characteristic curve of 88% and 90%,respectively, both for standard and accelerated registration.We conclude that the image registration package elastix wassubstantially accelerated, with nearly identical results to thenon-optimized version. The new functionality will become availablein the next release of elastix as open source under the BSD license.

  13. GPU accelerated flow solver for direct numerical simulation of turbulent flows

    Salvadore, Francesco [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy); Bernardini, Matteo, E-mail: matteo.bernardini@uniroma1.it [Department of Mechanical and Aerospace Engineering, University of Rome ‘La Sapienza’ – via Eudossiana 18, 00184 Rome (Italy); Botti, Michela [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy)

    2013-02-15

    Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.

  14. A New GPU-Enabled MODTRAN Thermal Model for the PLUME TRACKER Volcanic Emission Analysis Toolkit

    Acharya, P. K.; Berk, A.; Guiang, C.; Kennett, R.; Perkins, T.; Realmuto, V. J.

    2013-12-01

    Real-time quantification of volcanic gaseous and particulate releases is important for (1) recognizing rapid increases in SO2 gaseous emissions which may signal an impending eruption; (2) characterizing ash clouds to enable safe and efficient commercial aviation; and (3) quantifying the impact of volcanic aerosols on climate forcing. The Jet Propulsion Laboratory (JPL) has developed state-of-the-art algorithms, embedded in their analyst-driven Plume Tracker toolkit, for performing SO2, NH3, and CH4 retrievals from remotely sensed multi-spectral Thermal InfraRed spectral imagery. While Plume Tracker provides accurate results, it typically requires extensive analyst time. A major bottleneck in this processing is the relatively slow but accurate FORTRAN-based MODTRAN atmospheric and plume radiance model, developed by Spectral Sciences, Inc. (SSI). To overcome this bottleneck, SSI in collaboration with JPL, is porting these slow thermal radiance algorithms onto massively parallel, relatively inexpensive and commercially-available GPUs. This paper discusses SSI's efforts to accelerate the MODTRAN thermal emission algorithms used by Plume Tracker. Specifically, we are developing a GPU implementation of the Curtis-Godson averaging and the Voigt in-band transmittances from near line center molecular absorption, which comprise the major computational bottleneck. The transmittance calculations were decomposed into separate functions, individually implemented as GPU kernels, and tested for accuracy and performance relative to the original CPU code. Speedup factors of 14 to 30× were realized for individual processing components on an NVIDIA GeForce GTX 295 graphics card with no loss of accuracy. Due to the separate host (CPU) and device (GPU) memory spaces, a redesign of the MODTRAN architecture was required to ensure efficient data transfer between host and device, and to facilitate high parallel throughput. Currently, we are incorporating the separate GPU kernels into a

  15. Performance Assessment of a Gnss-Based Troposphere Path Delay Estimation Software

    Mariotti, Gilles; Avanzi, Alessandro; Graziani, Alberto; Tortora, Paolo

    2013-04-01

    perform the differentiation. The code relies on several IGS products, like SP3 precise orbits and SINEX positions available for the master stations in order to remove several error components, while the phase ambiguities (both wide and narrow lane) are resolved using the modified LAMBDA (MLAMBDA) method. The double-differenced data are then processed by a Kalman Filter that estimates the contingent positioning error of the rover station, its Zenith Wet Delay (ZWD) and the residual phase ambiguities. On the other hand, the Zenith Hydrostatic Delay (ZHD) is preliminarily computed using a mathematical model, based on surface meteorological measurements. The final product of the developed code is an output file containing the estimated ZWD and ZHD time-series in a format compatible with the major orbit determination software, e.g. the CSP card format (TRK-2-23) used by NASA JPL's Orbit Determination Program.

  16. Infrastructure for Multiphysics Software Integration in High Performance Computing-Aided Science and Engineering

    Campbell, Michael T. [Illinois Rocstar LLC, Champaign, IL (United States); Safdari, Masoud [Illinois Rocstar LLC, Champaign, IL (United States); Kress, Jessica E. [Illinois Rocstar LLC, Champaign, IL (United States); Anderson, Michael J. [Illinois Rocstar LLC, Champaign, IL (United States); Horvath, Samantha [Illinois Rocstar LLC, Champaign, IL (United States); Brandyberry, Mark D. [Illinois Rocstar LLC, Champaign, IL (United States); Kim, Woohyun [Illinois Rocstar LLC, Champaign, IL (United States); Sarwal, Neil [Illinois Rocstar LLC, Champaign, IL (United States); Weisberg, Brian [Illinois Rocstar LLC, Champaign, IL (United States)

    2016-10-15

    The project described in this report constructed and exercised an innovative multiphysics coupling toolkit called the Illinois Rocstar MultiPhysics Application Coupling Toolkit (IMPACT). IMPACT is an open source, flexible, natively parallel infrastructure for coupling multiple uniphysics simulation codes into multiphysics computational systems. IMPACT works with codes written in several high-performance-computing (HPC) programming languages, and is designed from the beginning for HPC multiphysics code development. It is designed to be minimally invasive to the individual physics codes being integrated, and has few requirements on those physics codes for integration. The goal of IMPACT is to provide the support needed to enable coupling existing tools together in unique and innovative ways to produce powerful new multiphysics technologies without extensive modification and rewrite of the physics packages being integrated. There are three major outcomes from this project: 1) construction, testing, application, and open-source release of the IMPACT infrastructure, 2) production of example open-source multiphysics tools using IMPACT, and 3) identification and engagement of interested organizations in the tools and applications resulting from the project. This last outcome represents the incipient development of a user community and application echosystem being built using IMPACT. Multiphysics coupling standardization can only come from organizations working together to define needs and processes that span the space of necessary multiphysics outcomes, which Illinois Rocstar plans to continue driving toward. The IMPACT system, including source code, documentation, and test problems are all now available through the public gitHUB.org system to anyone interested in multiphysics code coupling. Many of the basic documents explaining use and architecture of IMPACT are also attached as appendices to this document. Online HTML documentation is available through the gitHUB site

  17. Array abstractions for GPU programming

    Dybdal, Martin

    The shift towards massively parallel hardware platforms for highperformance computing tasks has introduced a need for improved programming models that facilitate ease of reasoning for both users and compiler optimization. A promising direction is the field of functional data-parallel programming......, for which functional invariants can be utilized by optimizing compilers to perform large program transformations automatically. However, the previous work in this area allow users only limited ability to reason about the performance of algorithms. For this reason, such languages have yet to see wide...... industrial adoption. We present two programming languages that attempt at both supporting industrial applications and providing reasoning tools for hierarchical data-parallel architectures, such as GPUs. First, we present TAIL, an array based intermediate language and compiler framework for compiling a large...

  18. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

    Ammendola, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

    2012-01-01

    In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.

  19. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

    Ammendola, R [INFN Tor Vergata (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma (Italy)

    2012-12-13

    In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative - the QUonG project - whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k Euro-Sign /T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.

  20. Uso de tarjetas GPU para acelerar el procesado de señales

    Amat Sanz, David

    2017-01-01

    This Bachelor's Degree Final Project aims to analyze and implement another way to process digital signals, improving their performance and speed of execution. DSP and FPGA are the most commonly used elements for any kind of signal processing. This project focuses on the use of graphics cards (GPU) to exploit to the maximum the parallelism that is available today. Current processors (CPUs) have a few cores and work sequentially which can be very time consuming if large amounts of data are bein...

  1. Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

    Bernaschi, M.; Bisson, M.; Salvadore, F.

    2014-10-01

    We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.

  2. Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.

    Zheng, Mo; Li, Xiaoxia; Guo, Li

    2013-04-01

    Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.

  3. Supporting motivation, task performance and retention in video tutorials for software training

    van der Meij, Hans; van der Meij, Jan; Voerman, Tessa; Duipmans, Evert

    2017-01-01

    Video tutorials for software training are becoming more and more popular, but their construction and effectiveness is understudied. This paper presents a theoretical model that combines demonstration-based training (DBT) and multimedia learning theory as a framework for design. The study

  4. Cross Sectional Study of Agile Software Development Methods and Project Performance

    Lambert, Tracy

    2011-01-01

    Agile software development methods, characterized by delivering customer value via incremental and iterative time-boxed development processes, have moved into the mainstream of the Information Technology (IT) industry. However, despite a growing body of research which suggests that a predictive manufacturing approach, with big up-front…

  5. RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

    Bailey, Nicholas; Ingebrigtsen, Trond; Hansen, Jesper Schmidt

    2017-01-01

    RUMD is a general purpose, high-performance molecular dynamics (MD) simulation package running on graphical processing units (GPU’s). RUMD addresses the challenge of utilizing the many-core nature of modern GPU hardware when simulating small to medium system sizes (roughly from a few thousand up...

  6. Development of a software application to evaluate the performance and energy losses of grid-connected photovoltaic systems

    Trillo-Montero, D.; Santiago, I.; Luna-Rodriguez, J.J.; Real-Calvo, R.

    2014-01-01

    Highlights: • Software application to perform an automated analysis of grid-connected PV systems. • It integrates data from all devices registering data on typical PV installations. • Flexible to analyze installations with different configurations and components. • An analysis of two grid-connected PV systems located in Andalusia, was performed. • Temperature losses in summer months varying between 15% and 25% of energy production. - Abstract: The aim of this paper was to design and develop a software application that enables users to perform an automated analysis of data from the monitoring of grid-connected photovoltaic (PV) systems. This application integrates data from all devices already in operation such as environmental sensors, inverters and meters, which record information on typical PV installations. This required the development of a Relational Database Management System (RDBMS), consisting of a series of linked databases, enabling all PV system information to be stored; and a software, called S·lar, which enables all information from the monitoring to be automatically migrated to the database as well as determining some standard magnitudes related to performances and losses of PV installation components at different time scales. A visualization tool, which is both graphical and numerical, makes access to all of the information be a simple task. Moreover, the application enables relationships between parameters and/or magnitudes to be easily established. Furthermore, it can perform a preliminary analysis of the influence of PV installations on the distribution grids where the produced electricity is injected. The operation of such a software application was implemented by performing the analysis of two grid-connected PV installations located in Andalusia, Spain, via data monitoring therein. The monitoring took place from January 2011 to May 2012

  7. Free software for performing physical analysis of systems for digital radiography and mammography

    Donini, Bruno; Lanconelli, Nico, E-mail: nico.lanconelli@unibo.it [Alma Mater Studiorum, Department of Physics and Astronomy, University of Bologna, Bologna 40127 (Italy); Rivetti, Stefano [Fisica Medica, Ospedale di Sassuolo S.p.A., Sassuolo 41049 (Italy); Bertolini, Marco [Medical Physics Unit, Azienda Ospedaliera ASMN, Istituto di Ricovero e Cura a Carattere Scientifico, Reggio Emilia 42123 (Italy)

    2014-05-15

    Purpose: In this paper, the authors present a free software for assisting users in achieving the physical characterization of x-ray digital systems and image quality checks. Methods: The program was developed as a plugin of a well-known public-domain suite ImageJ. The software can assist users in calculating various physical parameters such as the response curve (also termed signal transfer property), modulation transfer function (MTF), noise power spectra (NPS), and detective quantum efficiency (DQE). It also includes the computation of some image quality checks: defective pixel analysis, uniformity, dark analysis, and lag. Results: The software was made available in 2009 and has been used during the last couple of years by many users who gave us valuable feedback for improving its usability. It was tested for achieving the physical characterization of several clinical systems for digital radiography and mammography. Various published papers made use of the outcomes of the plugin. Conclusions: This software is potentially beneficial to a variety of users: physicists working in hospitals, staff working in radiological departments, such as medical physicists, physicians, engineers. The plugin, together with a brief user manual, are freely available and can be found online ( http://www.medphys.it/downloads.htm ). With our plugin users can estimate all three most important parameters used for physical characterization (MTF, NPS, and also DQE). The plugin can run on any operating system equipped with ImageJ suite. The authors validated the software by comparing MTF and NPS curves on a common set of images with those obtained with other dedicated programs, achieving a very good agreement.

  8. Free software for performing physical analysis of systems for digital radiography and mammography.

    Donini, Bruno; Rivetti, Stefano; Lanconelli, Nico; Bertolini, Marco

    2014-05-01

    In this paper, the authors present a free software for assisting users in achieving the physical characterization of x-ray digital systems and image quality checks. The program was developed as a plugin of a well-known public-domain suite ImageJ. The software can assist users in calculating various physical parameters such as the response curve (also termed signal transfer property), modulation transfer function (MTF), noise power spectra (NPS), and detective quantum efficiency (DQE). It also includes the computation of some image quality checks: defective pixel analysis, uniformity, dark analysis, and lag. The software was made available in 2009 and has been used during the last couple of years by many users who gave us valuable feedback for improving its usability. It was tested for achieving the physical characterization of several clinical systems for digital radiography and mammography. Various published papers made use of the outcomes of the plugin. This software is potentially beneficial to a variety of users: physicists working in hospitals, staff working in radiological departments, such as medical physicists, physicians, engineers. The plugin, together with a brief user manual, are freely available and can be found online (www.medphys.it/downloads.htm). With our plugin users can estimate all three most important parameters used for physical characterization (MTF, NPS, and also DQE). The plugin can run on any operating system equipped with ImageJ suite. The authors validated the software by comparing MTF and NPS curves on a common set of images with those obtained with other dedicated programs, achieving a very good agreement.

  9. Free software for performing physical analysis of systems for digital radiography and mammography

    Donini, Bruno; Lanconelli, Nico; Rivetti, Stefano; Bertolini, Marco

    2014-01-01

    Purpose: In this paper, the authors present a free software for assisting users in achieving the physical characterization of x-ray digital systems and image quality checks. Methods: The program was developed as a plugin of a well-known public-domain suite ImageJ. The software can assist users in calculating various physical parameters such as the response curve (also termed signal transfer property), modulation transfer function (MTF), noise power spectra (NPS), and detective quantum efficiency (DQE). It also includes the computation of some image quality checks: defective pixel analysis, uniformity, dark analysis, and lag. Results: The software was made available in 2009 and has been used during the last couple of years by many users who gave us valuable feedback for improving its usability. It was tested for achieving the physical characterization of several clinical systems for digital radiography and mammography. Various published papers made use of the outcomes of the plugin. Conclusions: This software is potentially beneficial to a variety of users: physicists working in hospitals, staff working in radiological departments, such as medical physicists, physicians, engineers. The plugin, together with a brief user manual, are freely available and can be found online ( http://www.medphys.it/downloads.htm ). With our plugin users can estimate all three most important parameters used for physical characterization (MTF, NPS, and also DQE). The plugin can run on any operating system equipped with ImageJ suite. The authors validated the software by comparing MTF and NPS curves on a common set of images with those obtained with other dedicated programs, achieving a very good agreement

  10. AR2, a novel automatic muscle artifact reduction software method for ictal EEG interpretation: Validation and comparison of performance with commercially available software [version 2; referees: 2 approved

    Shennan Aibel Weiss

    2017-04-01

    Full Text Available Objective: To develop a novel software method (AR2 for reducing muscle contamination of ictal scalp electroencephalogram (EEG, and validate this method on the basis of its performance in comparison to a commercially available software method (AR1 to accurately depict seizure-onset location. Methods: A blinded investigation used 23 EEG recordings of seizures from 8 patients. Each recording was uninterpretable with digital filtering because of muscle artifact and processed using AR1 and AR2 and reviewed by 26 EEG specialists. EEG readers assessed seizure-onset time, lateralization, and region, and specified confidence for each determination. The two methods were validated on the basis of the number of readers able to render assignments, confidence, the intra-class correlation (ICC, and agreement with other clinical findings. Results: Among the 23 seizures, two-thirds of the readers were able to delineate seizure-onset time in 10 of 23 using AR1, and 15 of 23 using AR2 (p<0.01. Fewer readers could lateralize seizure-onset (p<0.05. The confidence measures of the assignments were low (probable-unlikely, but increased using AR2 (p<0.05. The ICC for identifying the time of seizure-onset was 0.15 (95% confidence interval (CI, 0.11-0.18 using AR1 and 0.26 (95% CI 0.21-0.30 using AR2.  The EEG interpretations were often consistent with behavioral, neurophysiological, and neuro-radiological findings, with left sided assignments correct in 95.9% (CI 85.7-98.9%, n=4 of cases using AR2, and 91.9% (77.0-97.5% (n=4 of cases using AR1. Conclusions: EEG artifact reduction methods for localizing seizure-onset does not result in high rates of interpretability, reader confidence, and inter-reader agreement. However, the assignments by groups of readers are often congruent with other clinical data. Utilization of the AR2 software method may improve the validity of ictal EEG artifact reduction.

  11. Work-Efficient Parallel Skyline Computation for the GPU

    Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira

    2015-01-01

    offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...

  12. Data assimilation using a GPU accelerated path integral Monte Carlo approach

    Quinn, John C.; Abarbanel, Henry D. I.

    2011-09-01

    The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin-Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.

  13. Convolution of large 3D images on GPU and its decomposition

    Karas, Pavel; Svoboda, David

    2011-12-01

    In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.

  14. A Simple GPU-Accelerated Two-Dimensional MUSCL-Hancock Solver for Ideal Magnetohydrodynamics

    Bard, Christopher; Dorelli, John C.

    2013-01-01

    We describe our experience using NVIDIA's CUDA (Compute Unified Device Architecture) C programming environment to implement a two-dimensional second-order MUSCL-Hancock ideal magnetohydrodynamics (MHD) solver on a GTX 480 Graphics Processing Unit (GPU). Taking a simple approach in which the MHD variables are stored exclusively in the global memory of the GTX 480 and accessed in a cache-friendly manner (without further optimizing memory access by, for example, staging data in the GPU's faster shared memory), we achieved a maximum speed-up of approx. = 126 for a sq 1024 grid relative to the sequential C code running on a single Intel Nehalem (2.8 GHz) core. This speedup is consistent with simple estimates based on the known floating point performance, memory throughput and parallel processing capacity of the GTX 480.

  15. The Performance Evaluation of Multi-Image 3d Reconstruction Software with Different Sensors

    Mousavi, V.; Khosravi, M.; Ahmadi, M.; Noori, N.; Naveh, A. Hosseini; Varshosaz, M.

    2015-12-01

    Today, multi-image 3D reconstruction is an active research field and generating three dimensional model of the objects is one the most discussed issues in Photogrammetry and Computer Vision that can be accomplished using range-based or image-based methods. Very accurate and dense point clouds generated by range-based methods such as structured light systems and laser scanners has introduced them as reliable tools in the industry. Image-based 3D digitization methodologies offer the option of reconstructing an object by a set of unordered images that depict it from different viewpoints. As their hardware requirements are narrowed down to a digital camera and a computer system, they compose an attractive 3D digitization approach, consequently, although range-based methods are generally very accurate, image-based methods are low-cost and can be easily used by non-professional users. One of the factors affecting the accuracy of the obtained model in image-based methods is the software and algorithm used to generate three dimensional model. These algorithms are provided in the form of commercial software, open source and web-based services. Another important factor in the accuracy of the obtained model is the type of sensor used. Due to availability of mobile sensors to the public, popularity of professional sensors and the advent of stereo sensors, a comparison of these three sensors plays an effective role in evaluating and finding the optimized method to generate three-dimensional models. Lots of research has been accomplished to identify a suitable software and algorithm to achieve an accurate and complete model, however little attention is paid to the type of sensors used and its effects on the quality of the final model. The purpose of this paper is deliberation and the introduction of an appropriate combination of a sensor and software to provide a complete model with the highest accuracy. To do this, different software, used in previous studies, were compared and

  16. Spectral analysis software improves confidence in plant and soil water stable isotope analyses performed by isotope ratio infrared spectroscopy (IRIS).

    West, A G; Goldsmith, G R; Matimati, I; Dawson, T E

    2011-08-30

    Previous studies have demonstrated the potential for large errors to occur when analyzing waters containing organic contaminants using isotope ratio infrared spectroscopy (IRIS). In an attempt to address this problem, IRIS manufacturers now provide post-processing spectral analysis software capable of identifying samples with the types of spectral interference that compromises their stable isotope analysis. Here we report two independent tests of this post-processing spectral analysis software on two IRIS systems, OA-ICOS (Los Gatos Research Inc.) and WS-CRDS (Picarro Inc.). Following a similar methodology to a previous study, we cryogenically extracted plant leaf water and soil water and measured the δ(2)H and δ(18)O values of identical samples by isotope ratio mass spectrometry (IRMS) and IRIS. As an additional test, we analyzed plant stem waters and tap waters by IRMS and IRIS in an independent laboratory. For all tests we assumed that the IRMS value represented the "true" value against which we could compare the stable isotope results from the IRIS methods. Samples showing significant deviations from the IRMS value (>2σ) were considered to be contaminated and representative of spectral interference in the IRIS measurement. Over the two studies, 83% of plant species were considered contaminated on OA-ICOS and 58% on WS-CRDS. Post-analysis, spectra were analyzed using the manufacturer's spectral analysis software, in order to see if the software correctly identified contaminated samples. In our tests the software performed well, identifying all the samples with major errors. However, some false negatives indicate that user evaluation and testing of the software are necessary. Repeat sampling of plants showed considerable variation in the discrepancies between IRIS and IRMS. As such, we recommend that spectral analysis of IRIS data must be incorporated into standard post-processing routines. Furthermore, we suggest that the results from spectral analysis be

  17. GPU-accelerated ray-tracing for real-time treatment planning

    Heinrich, H; Ziegenhein, P; Kamerling, C P; Oelfke, U; Froening, H

    2014-01-01

    Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.

  18. Efficient parallel implementation of active appearance model fitting algorithm on GPU.

    Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

    2014-01-01

    The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  19. Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

    Jinwei Wang

    2014-01-01

    Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

  20. Accelerating Computation of DCM for ERP in MATLAB by External Function Calls to the GPU

    Wang, Wei-Jen; Hsieh, I-Fan; Chen, Chun-Chuan

    2013-01-01

    This study aims to improve the performance of Dynamic Causal Modelling for Event Related Potentials (DCM for ERP) in MATLAB by using external function calls to a graphics processing unit (GPU). DCM for ERP is an advanced method for studying neuronal effective connectivity. DCM utilizes an iterative procedure, the expectation maximization (EM) algorithm, to find the optimal parameters given a set of observations and the underlying probability model. As the EM algorithm is computationally demanding and the analysis faces possible combinatorial explosion of models to be tested, we propose a parallel computing scheme using the GPU to achieve a fast estimation of DCM for ERP. The computation of DCM for ERP is dynamically partitioned and distributed to threads for parallel processing, according to the DCM model complexity and the hardware constraints. The performance efficiency of this hardware-dependent thread arrangement strategy was evaluated using the synthetic data. The experimental data were used to validate the accuracy of the proposed computing scheme and quantify the time saving in practice. The simulation results show that the proposed scheme can accelerate the computation by a factor of 155 for the parallel part. For experimental data, the speedup factor is about 7 per model on average, depending on the model complexity and the data. This GPU-based implementation of DCM for ERP gives qualitatively the same results as the original MATLAB implementation does at the group level analysis. In conclusion, we believe that the proposed GPU-based implementation is very useful for users as a fast screen tool to select the most likely model and may provide implementation guidance for possible future clinical applications such as online diagnosis. PMID:23840507