tesla gpu cluster: Topics by WorldWideScience.org

Sample records for tesla gpu cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

Energy Technology Data Exchange (ETDEWEB)

Allada, Veerendra, Benjegerdes, Troy; Bode, Brett

2009-08-31

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

International Nuclear Information System (INIS)

Allada, Veerendra; Benjegerdes, Troy; Bode, Brett

2009-01-01

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
Accelerating three-dimensional FDTD calculations on GPU clusters for electromagnetic field simulation.

Science.gov (United States)

Nagaoka, Tomoaki; Watanabe, Soichi

2012-01-01

Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
Synergia CUDA: GPU-accelerated accelerator modeling package

International Nuclear Information System (INIS)

Lu, Q; Amundson, J

2014-01-01

Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Science.gov (United States)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
Large Scale Simulations of the Euler Equations on GPU Clusters

KAUST Repository

Liebmann, Manfred

2010-08-01

The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.
PIConGPU - A highly-scalable particle-in-cell implementation for GPU clusters

Energy Technology Data Exchange (ETDEWEB)

Bussmann, Michael; Burau, Heiko; Debus, Alexander; Huebl, Axel; Kluge, Thomas; Pausch, Richard; Schmeisser, Nils; Steiniger, Klaus; Widera, Rene; Wyderka, Nikolai; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Schneider, Benjamin [HZDR, Dresden (Germany); TU Dresden (Germany); Schmitt, Felix [NVIDIA, Austin, TX (United States); Grottel, Sebastian; Gumhold, Stefan [TU Dresden (Germany); Juckeland, Guido; Angel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany)

2013-07-01

PIConGPU can handle large-scale simulations of laser plasma and astrophysical plasma dynamics on GPU clusters with thousands of GPUs. High data throughput allows to conduct large parameter surveys but makes it necessary to rethink data analysis and look for new ways of analyzing large simulation data sets. The speedup seen on GPUs enables scientists to add physical effects to their code that up until recently have been too computationally demanding. We present recent results obtained with PIConGPU, discuss scaling behaviour, the most important building blocks of the code and new physics modules recently added. In addition we give an outlook on data analysis, resiliance and load balancing with PIConGPU.
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

2010-01-01

Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Graph coarsening and clustering on the GPU

NARCIS (Netherlands)

Fagginger Auer, B.O.; Bisseling, R.H.

2013-01-01

Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

International Nuclear Information System (INIS)

Komatitsch, Dimitri; Erlebacher, Gordon; Goeddeke, Dominik; Michea, David

2010-01-01

We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.
Non-parametric co-clustering of large scale sparse bipartite networks on the GPU

DEFF Research Database (Denmark)

Hansen, Toke Jansen; Mørup, Morten; Hansen, Lars Kai

2011-01-01

of row and column clusters from a hypothesis space of an infinite number of clusters. To reach large scale applications of co-clustering we exploit that parameter inference for co-clustering is well suited for parallel computing. We develop a generic GPU framework for efficient inference on large scale...... sparse bipartite networks and achieve a speedup of two orders of magnitude compared to estimation based on conventional CPUs. In terms of scalability we find for networks with more than 100 million links that reliable inference can be achieved in less than an hour on a single GPU. To efficiently manage...
Heterogeneous Gpu&Cpu Cluster For High Performance Computing In Cryptography

Directory of Open Access Journals (Sweden)

Michał Marks

2012-01-01

Full Text Available This paper addresses issues associated with distributed computing systems andthe application of mixed GPU&CPU technology to data encryption and decryptionalgorithms. We describe a heterogenous cluster HGCC formed by twotypes of nodes: Intel processor with NVIDIA graphics processing unit and AMDprocessor with AMD graphics processing unit (formerly ATI, and a novel softwareframework that hides the heterogeneity of our cluster and provides toolsfor solving complex scientific and engineering problems. Finally, we present theresults of numerical experiments. The considered case study is concerned withparallel implementations of selected cryptanalysis algorithms. The main goal ofthe paper is to show the wide applicability of the GPU&CPU technology tolarge scale computation and data processing.
Parallel computing in cluster of GPU applied to a problem of nuclear engineering

International Nuclear Information System (INIS)

Moraes, Sergio Ricardo S.; Heimlich, Adino; Resende, Pedro

2013-01-01

Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA - Compute Unified Device Architecture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8-GPU with the sequential version. Results here presented are discussed and analyzed with the objective of outlining gains and possible limitations of the proposed approach. (author)
Large Scale Simulations of the Euler Equations on GPU Clusters

KAUST Repository

Liebmann, Manfred; Douglas, Craig C.; Haase, Gundolf; Horvá th, Zoltá n

2010-01-01

The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one
A high performance image processing platform based on CPU-GPU heterogeneous cluster with parallel image reconstroctions for micro-CT

International Nuclear Information System (INIS)

Ding Yu; Qi Yujin; Zhang Xuezhu; Zhao Cuilan

2011-01-01

In this paper, we report the development of a high-performance image processing platform, which is based on CPU-GPU heterogeneous cluster. Currently, it consists of a Dell Precision T7500 and HP XW8600 workstations with parallel programming and runtime environment, using the message-passing interface (MPI) and CUDA (Compute Unified Device Architecture). We succeeded in developing parallel image processing techniques for 3D image reconstruction of X-ray micro-CT imaging. The results show that a GPU provides a computing efficiency of about 194 times faster than a single CPU, and the CPU-GPU clusters provides a computing efficiency of about 46 times faster than the CPU clusters. These meet the requirements of rapid 3D image reconstruction and real time image display. In conclusion, the use of CPU-GPU heterogeneous cluster is an effective way to build high-performance image processing platform. (authors)
GPU Computing For Particle Tracking

International Nuclear Information System (INIS)

Nishimura, Hiroshi; Song, Kai; Muriki, Krishna; Sun, Changchun; James, Susan; Qin, Yong

2011-01-01

This is a feasibility study of using a modern Graphics Processing Unit (GPU) to parallelize the accelerator particle tracking code. To demonstrate the massive parallelization features provided by GPU computing, a simplified TracyGPU program is developed for dynamic aperture calculation. Performances, issues, and challenges from introducing GPU are also discussed. General purpose Computation on Graphics Processing Units (GPGPU) bring massive parallel computing capabilities to numerical calculation. However, the unique architecture of GPU requires a comprehensive understanding of the hardware and programming model to be able to well optimize existing applications. In the field of accelerator physics, the dynamic aperture calculation of a storage ring, which is often the most time consuming part of the accelerator modeling and simulation, can benefit from GPU due to its embarrassingly parallel feature, which fits well with the GPU programming model. In this paper, we use the Tesla C2050 GPU which consists of 14 multi-processois (MP) with 32 cores on each MP, therefore a total of 448 cores, to host thousands ot threads dynamically. Thread is a logical execution unit of the program on GPU. In the GPU programming model, threads are grouped into a collection of blocks Within each block, multiple threads share the same code, and up to 48 KB of shared memory. Multiple thread blocks form a grid, which is executed as a GPU kernel. A simplified code that is a subset of Tracy++ (2) is developed to demonstrate the possibility of using GPU to speed up the dynamic aperture calculation by having each thread track a particle.
Aspects of GPU perfomance in algorithms with random memory access

Science.gov (United States)

Kashkovsky, Alexander V.; Shershnev, Anton A.; Vashchenkov, Pavel V.

2017-10-01

The numerical code for solving the Boltzmann equation on the hybrid computational cluster using the Direct Simulation Monte Carlo (DSMC) method showed that on Tesla K40 accelerators computational performance drops dramatically with increase of percentage of occupied GPU memory. Testing revealed that memory access time increases tens of times after certain critical percentage of memory is occupied. Moreover, it seems to be the common problem of all NVidia's GPUs arising from its architecture. Few modifications of the numerical algorithm were suggested to overcome this problem. One of them, based on the splitting the memory into "virtual" blocks, resulted in 2.5 times speed up.
The Performance Improvement of the Lagrangian Particle Dispersion Model (LPDM) Using Graphics Processing Unit (GPU) Computing

Science.gov (United States)

2017-08-01

used for its GPU computing capability during the experiment. It has Nvidia Tesla K40 GPU accelerators containing 32 GPU nodes consisting of 1024...cores. CUDA is a parallel computing platform and application programming interface (API) model that was created and designed by Nvidia to give direct...Agricultural and Forest Meteorology. 1995:76:277–291, ISSN 0168-1923. 3. GPU vs. CPU? What is GPU computing? Santa Clara (CA): Nvidia Corporation; 2017
Noniterative Multireference Coupled Cluster Methods on Heterogeneous CPU-GPU Systems

Energy Technology Data Exchange (ETDEWEB)

Bhaskaran-Nair, Kiran; Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; van Dam, Hubertus JJ; Apra, Edoardo; Kowalski, Karol

2013-04-09

A novel parallel algorithm for non-iterative multireference coupled cluster (MRCC) theories, which merges recently introduced reference-level parallelism (RLP) [K. Bhaskaran-Nair, J.Brabec, E. Aprà, H.J.J. van Dam, J. Pittner, K. Kowalski, J. Chem. Phys. 137, 094112 (2012)] with the possibility of accelerating numerical calculations using graphics processing unit (GPU) is presented. We discuss the performance of this algorithm on the example of the MRCCSD(T) method (iterative singles and doubles and perturbative triples), where the corrections due to triples are added to the diagonal elements of the MRCCSD (iterative singles and doubles) effective Hamiltonian matrix. The performance of the combined RLP/GPU algorithm is illustrated on the example of the Brillouin-Wigner (BW) and Mukherjee (Mk) state-specific MRCCSD(T) formulations.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.

Science.gov (United States)

Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto

2015-02-13

In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the

Fully 3-D list-mode positron emission tomography image reconstruction on a multi-GPU cluster

Energy Technology Data Exchange (ETDEWEB)

Cui, Jingyu [Stanford Univ., CA (United States). Dept. of Electrical Engineering; Prevrhal, Sven; Shao, Lingxiong [Philips Healthcare, San Jose, CA (United States); Pratx, Guillem [Stanford Univ., CA (United States). Dept. of Radiation Oncology; Levin, Craig S. [Stanford Univ., CA (United States). Dept. of Radiology, Electrical Engineering, and Physics; Stanford Univ., CA (United States). Molecular Imaging Program at Stanford (MIPS); Stanford Univ., CA (United States). School of Medicine

2011-07-01

List-mode processing is an efficient way of dealing with the sparse nature of PET data sets, and is the processing method of choice for time-of-flight (ToF) PET. We present a novel method of computing line projection operations required for list-mode ordered subsets expectation maximization (OSEM) for fully 3-D PET image reconstruction on a graphics processing unit (GPU) using the compute unified device architecture (CUDA) framework. Our method overcomes challenges such as compute thread divergence, and exploits GPU capabilities such as shared memory and atomic operations. When applied to line projection operations for list-mode time-of-flight PET, this new GPU-CUDA reformulation is 188X faster than a single-threaded reference CPU implementation. When embedded in a multi-process environment on a GPU-equipped small cluster, a speedup of 4X was observed over the same configuration but without GPU support. Image quality is preserved with root mean squared (RMS) deviation of 0.05% between CPU and GPU-generated images, which has negligible effect in typical clinical applications. (orig.)
High-Performance Matrix-Vector Multiplication on the GPU

DEFF Research Database (Denmark)

Sørensen, Hans Henrik Brandenborg

2012-01-01

In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...
GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

Directory of Open Access Journals (Sweden)

Ronglin Jiang

2014-01-01

Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
Comparison of 2 accelerators of Monte Carlo radiation transport calculations, NVIDIA tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor: a case study for X-ray CT Imaging Dose calculation

International Nuclear Information System (INIS)

Liu, T.; Xu, X.G.; Carothers, C.D.

2013-01-01

Hardware accelerators are currently becoming increasingly important in boosting high performance computing systems. In this study, we tested the performance of two accelerator models, NVIDIA Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor, using a new Monte Carlo photon transport package called ARCHER-CT we have developed for fast CT imaging dose calculation. The package contains three code variants, ARCHER-CT(CPU), ARCHER-CT(GPU) and ARCHER-CT(COP) to run in parallel on the multi-core CPU, GPU and coprocessor architectures respectively. A detailed GE LightSpeed Multi-Detector Computed Tomography (MDCT) scanner model and a family of voxel patient phantoms were included in the code to calculate absorbed dose to radiosensitive organs under specified scan protocols. The results from ARCHER agreed well with those from the production code Monte Carlo N-Particle eXtended (MCNPX). It was found that all the code variants were significantly faster than the parallel MCNPX running on 12 MPI processes, and that the GPU and coprocessor performed equally well, being 2.89-4.49 and 3.01-3.23 times faster than the parallel ARCHER-CT(CPU) running with 12 hyper-threads. (authors)
Implementation of the Lucas-Kanade image registration algorithm on a GPU for 3D computational platform stabilisation

CSIR Research Space (South Africa)

Duvenhage, B

2010-06-01

Full Text Available rate of 15 fps at an image and ROI size of 640 480 pixels. This result was measured on an NVidia Tesla C870 GPU with about half as many processor cores as the GeForce GTX285 GPU. Marzat, et al. however estimate that their execu- tion times would...
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

Science.gov (United States)

Nagaoka, Tomoaki; Watanabe, Soichi

2011-01-01

Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
PIConGPU - How to build one of the fastest GPU particle-in-cell codes in the world

Energy Technology Data Exchange (ETDEWEB)

Burau, Heiko; Debus, Alexander; Helm, Anton; Huebl, Axel; Kluge, Thomas; Widera, Rene; Bussmann, Michael; Schramm, Ulrich; Cowan, Thomas [HZDR, Dresden (Germany); Juckeland, Guido; Nagel, Wolfgang [TU Dresden (Germany); ZIH, Dresden (Germany); Schmitt, Felix [NVIDIA (United States)

2013-07-01

We present the algorithmic building blocks of PIConGPU, one of the fastest implementations of the particle-in-cell algortihm on GPU clusters. PIConGPU is a highly-scalable, 3D3V electromagnetic PIC code that is used in laser plasma and astrophysical plasma simulations.
Validation of GPU based TomoTherapy dose calculation engine.

Science.gov (United States)

Chen, Quan; Lu, Weiguo; Chen, Yu; Chen, Mingli; Henderson, Douglas; Sterpin, Edmond

2012-04-01

The graphic processing unit (GPU) based TomoTherapy convolution/superposition(C/S) dose engine (GPU dose engine) achieves a dramatic performance improvement over the traditional CPU-cluster based TomoTherapy dose engine (CPU dose engine). Besides the architecture difference between the GPU and CPU, there are several algorithm changes from the CPU dose engine to the GPU dose engine. These changes made the GPU dose slightly different from the CPU-cluster dose. In order for the commercial release of the GPU dose engine, its accuracy has to be validated. Thirty eight TomoTherapy phantom plans and 19 patient plans were calculated with both dose engines to evaluate the equivalency between the two dose engines. Gamma indices (Γ) were used for the equivalency evaluation. The GPU dose was further verified with the absolute point dose measurement with ion chamber and film measurements for phantom plans. Monte Carlo calculation was used as a reference for both dose engines in the accuracy evaluation in heterogeneous phantom and actual patients. The GPU dose engine showed excellent agreement with the current CPU dose engine. The majority of cases had over 99.99% of voxels with Γ(1%, 1 mm) engine also showed similar degree of accuracy in heterogeneous media as the current TomoTherapy dose engine. It is verified and validated that the ultrafast TomoTherapy GPU dose engine can safely replace the existing TomoTherapy cluster based dose engine without degradation in dose accuracy.
Comparison of two accelerators for Monte Carlo radiation transport calculations, Nvidia Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor: A case study for X-ray CT imaging dose calculation

International Nuclear Information System (INIS)

Liu, T.; Xu, X.G.; Carothers, C.D.

2015-01-01

Highlights: • A new Monte Carlo photon transport code ARCHER-CT for CT dose calculations is developed to execute on the GPU and coprocessor. • ARCHER-CT is verified against MCNP. • The GPU code on an Nvidia M2090 GPU is 5.15–5.81 times faster than the parallel CPU code on an Intel X5650 6-core CPU. • The coprocessor code on an Intel Xeon Phi 5110p coprocessor is 3.30–3.38 times faster than the CPU code. - Abstract: Hardware accelerators are currently becoming increasingly important in boosting high performance computing systems. In this study, we tested the performance of two accelerator models, Nvidia Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor, using a new Monte Carlo photon transport package called ARCHER-CT we have developed for fast CT imaging dose calculation. The package contains three components, ARCHER-CT CPU , ARCHER-CT GPU and ARCHER-CT COP designed to be run on the multi-core CPU, GPU and coprocessor architectures respectively. A detailed GE LightSpeed Multi-Detector Computed Tomography (MDCT) scanner model and a family of voxel patient phantoms are included in the code to calculate absorbed dose to radiosensitive organs under user-specified scan protocols. The results from ARCHER agree well with those from the production code Monte Carlo N-Particle eXtended (MCNPX). It is found that all the code components are significantly faster than the parallel MCNPX run on 12 MPI processes, and that the GPU and coprocessor codes are 5.15–5.81 and 3.30–3.38 times faster than the parallel ARCHER-CT CPU , respectively. The M2090 GPU performs better than the 5110p coprocessor in our specific test. Besides, the heterogeneous computation mode in which the CPU and the hardware accelerator work concurrently can increase the overall performance by 13–18%
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.

Science.gov (United States)

Nagaoka, Tomoaki; Watanabe, Soichi

2010-01-01

Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
Comparison of Two Accelerators for Monte Carlo Radiation Transport Calculations, NVIDIA Tesla M2090 GPU and Intel Xeon Phi 5110p Coprocessor: A Case Study for X-ray CT Imaging Dose Calculation

Science.gov (United States)

Liu, Tianyu; Xu, X. George; Carothers, Christopher D.

2014-06-01

Hardware accelerators are currently becoming increasingly important in boosting high performance computing sys- tems. In this study, we tested the performance of two accelerator models, NVIDIA Tesla M2090 GPU and Intel Xeon Phi 5110p coprocessor, using a new Monte Carlo photon transport package called ARCHER-CT we have developed for fast CT imaging dose calculation. The package contains three code variants, ARCHER - CTCPU, ARCHER - CTGPU and ARCHER - CTCOP to run in parallel on the multi-core CPU, GPU and coprocessor architectures respectively. A detailed GE LightSpeed Multi-Detector Computed Tomography (MDCT) scanner model and a family of voxel patient phantoms were included in the code to calculate absorbed dose to radiosensitive organs under specified scan protocols. The results from ARCHER agreed well with those from the production code Monte Carlo N-Particle eXtended (MCNPX). It was found that all the code variants were significantly faster than the parallel MCNPX running on 12 MPI processes, and that the GPU and coprocessor performed equally well, being 2.89~4.49 and 3.01~3.23 times faster than the parallel ARCHER - CTCPU running with 12 hyperthreads.
CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms.

Science.gov (United States)

Kohlhoff, Kai J; Sosnick, Marc H; Hsu, William T; Pande, Vijay S; Altman, Russ B

2011-08-15

Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures. CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL. Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453. kjk33@cantab.net.
A Parallel Algebraic Multigrid Solver on Graphics Processing Units

KAUST Repository

Haase, Gundolf; Liebmann, Manfred; Douglas, Craig C.; Plank, Gernot

2010-01-01

-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

Directory of Open Access Journals (Sweden)

Cheng-Ying Chou

Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms

Directory of Open Access Journals (Sweden)

Valeria Cardellini

2014-01-01

Full Text Available We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.
GPU-accelerated denoising of 3D magnetic resonance images

Energy Technology Data Exchange (ETDEWEB)

Howison, Mark; Wes Bethel, E.

2014-05-29

The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

International Nuclear Information System (INIS)

Gong Chunye; Liu Jie; Chi Lihua; Huang Haowei; Fang Jingyue; Gong Zhenghu

2011-01-01

Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S n ) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

Science.gov (United States)

Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu

2011-07-01

Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
A GPU-accelerated semi-implicit fractional-step method for numerical solutions of incompressible Navier-Stokes equations

Science.gov (United States)

Ha, Sanghyun; Park, Junshin; You, Donghyun

2018-01-01

Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution methods used in the semi-implicit fractional-step method take advantage of multiple tridiagonal matrices whose inversion is known as the major bottleneck for acceleration on a typical multi-core machine. A novel implementation of the semi-implicit fractional-step method designed for GPU acceleration of the incompressible Navier-Stokes equations is presented. Aspects of the programing model of Compute Unified Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the present method are discussed in detail. A data layout for efficient use of CUDA libraries is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the Navier-Stokes equations are computed on a GPU. Performance of the present method using CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI with the speed of solving one heptadiagonal matrix using a conjugate gradient method. An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup is reached for the same problem using a Tesla P100 GPU.
Efficient computation of k-Nearest Neighbour Graphs for large high-dimensional data sets on GPU clusters.

Directory of Open Access Journals (Sweden)

Ali Dashti

Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.

Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters

Science.gov (United States)

Esler, Kenneth

2011-03-01

Quantum Monte Carlo (QMC) has proved to be an invaluable tool for predicting the properties of matter from fundamental principles, combining very high accuracy with extreme parallel scalability. By solving the many-body Schrödinger equation through a stochastic projection, it achieves greater accuracy than mean-field methods and better scaling with system size than quantum chemical methods, enabling scientific discovery across a broad spectrum of disciplines. In recent years, graphics processing units (GPUs) have provided a high-performance and low-cost new approach to scientific computing, and GPU-based supercomputers are now among the fastest in the world. The multiple forms of parallelism afforded by QMC algorithms make the method an ideal candidate for acceleration in the many-core paradigm. We present the results of porting the QMCPACK code to run on GPU clusters using the NVIDIA CUDA platform. Using mixed precision on GPUs and MPI for intercommunication, we observe typical full-application speedups of approximately 10x to 15x relative to quad-core CPUs alone, while reproducing the double-precision CPU results within statistical error. We discuss the algorithm modifications necessary to achieve good performance on this heterogeneous architecture and present the results of applying our code to molecules and bulk materials. Supported by the U.S. DOE under Contract No. DOE-DE-FG05-08OR23336 and by the NSF under No. 0904572.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments

Directory of Open Access Journals (Sweden)

Jyh-Da Wei

2017-08-01

Full Text Available High-end graphics processing units (GPUs, such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1, which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs. Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform. Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.

Science.gov (United States)

Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

2017-01-01

High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
A GPU Parallelization of the Absolute Nodal Coordinate Formulation for Applications in Flexible Multibody Dynamics

Science.gov (United States)

2012-02-17

to be solved. Disclaimer: Reference herein to any specific commercial company , product, process, or service by trade name, trademark...data processing rather than data caching and control flow. To make use of this computational power, NVIDIA introduced a general purpose parallel...GPU implementations were run on an Intel Nehalem Xeon E5520 2.26GHz processor with an NVIDIA Tesla C2070 graphics card for varying numbers of
GASPRNG: GPU accelerated scalable parallel random number generator library

Science.gov (United States)

Gao, Shuang; Peterson, Gregory D.

2013-04-01

workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
Cpu/gpu Computing for AN Implicit Multi-Block Compressible Navier-Stokes Solver on Heterogeneous Platform

Science.gov (United States)

Deng, Liang; Bai, Hanli; Wang, Fang; Xu, Qingxin

2016-06-01

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.
GPU accelerated flow solver for direct numerical simulation of turbulent flows

Energy Technology Data Exchange (ETDEWEB)

Salvadore, Francesco [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy); Bernardini, Matteo, E-mail: matteo.bernardini@uniroma1.it [Department of Mechanical and Aerospace Engineering, University of Rome ‘La Sapienza’ – via Eudossiana 18, 00184 Rome (Italy); Botti, Michela [CASPUR – via dei Tizii 6/b, 00185 Rome (Italy)

2013-02-15

Graphical processing units (GPUs), characterized by significant computing performance, are nowadays very appealing for the solution of computationally demanding tasks in a wide variety of scientific applications. However, to run on GPUs, existing codes need to be ported and optimized, a procedure which is not yet standardized and may require non trivial efforts, even to high-performance computing specialists. In the present paper we accurately describe the porting to CUDA (Compute Unified Device Architecture) of a finite-difference compressible Navier–Stokes solver, suitable for direct numerical simulation (DNS) of turbulent flows. Porting and validation processes are illustrated in detail, with emphasis on computational strategies and techniques that can be applied to overcome typical bottlenecks arising from the porting of common computational fluid dynamics solvers. We demonstrate that a careful optimization work is crucial to get the highest performance from GPU accelerators. The results show that the overall speedup of one NVIDIA Tesla S2070 GPU is approximately 22 compared with one AMD Opteron 2352 Barcelona chip and 11 compared with one Intel Xeon X5650 Westmere core. The potential of GPU devices in the simulation of unsteady three-dimensional turbulent flows is proved by performing a DNS of a spatially evolving compressible mixing layer.
A Parallel Algebraic Multigrid Solver on Graphics Processing Units

KAUST Repository

Haase, Gundolf

2010-01-01

The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. © 2010 Springer-Verlag.
Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST

Science.gov (United States)

Zhu, You-Liang; Pan, Deng; Li, Zhan-Wei; Liu, Hong; Qian, Hu-Jun; Zhao, Yang; Lu, Zhong-Yuan; Sun, Zhao-Yan

2018-04-01

We describe the algorithm of employing multi-GPU power on the basis of Message Passing Interface (MPI) domain decomposition in a molecular dynamics code, GALAMOST, which is designed for the coarse-grained simulation of soft matters. The code of multi-GPU version is developed based on our previous single-GPU version. In multi-GPU runs, one GPU takes charge of one domain and runs single-GPU code path. The communication between neighbouring domains takes a similar algorithm of CPU-based code of LAMMPS, but is optimised specifically for GPUs. We employ a memory-saving design which can enlarge maximum system size at the same device condition. An optimisation algorithm is employed to prolong the update period of neighbour list. We demonstrate good performance of multi-GPU runs on the simulation of Lennard-Jones liquid, dissipative particle dynamics liquid, polymer and nanoparticle composite, and two-patch particles on workstation. A good scaling of many nodes on cluster for two-patch particles is presented.
A multi-GPU implementation of a D2Q37 lattice Boltzmann code

NARCIS (Netherlands)

Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Scagliarini, Andrea; Schifano, S.F.; Toschi, F.; Tripiccione, R.; Wyrzykowski, R.; Dongarra, J.; Karczewski, K.; Wasniewski, J.

2012-01-01

We describe a parallel implementation of a compressible Lattice Boltzmann code on a multi-GPU cluster based on Nvidia Fermi processors. We analyze how to optimize the algorithm for GP-GPU architectures, describe the implementation choices that we have adopted and compare our performance results with
Distributed GPU Computing in GIScience

Science.gov (United States)

Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.

2013-12-01

Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
Solving global optimization problems on GPU cluster

Energy Technology Data Exchange (ETDEWEB)

Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya [Lobachevsky State University of Nizhni Novgorod, Gagarin Avenue 23, 603950 Nizhni Novgorod (Russian Federation)

2016-06-08

The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization.

Science.gov (United States)

Fang, Yuling; Chen, Qingkui; Xiong, Neal N; Zhao, Deyu; Wang, Jingjuan

2017-08-04

This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things (IoT) computing environment. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU (Graphics Processing Unit) cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSNs. Then, using the CUDA (Compute Unified Device Architecture) Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes' diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

Directory of Open Access Journals (Sweden)

M. Huang

2015-09-01

Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications

Directory of Open Access Journals (Sweden)

Seongseop Kim

2017-11-01

-scaled IoT sensing applications, we use NVIDIA Tesla K40c as target GPU. The execution time of the proposed multi-state-space model-based algorithm is similar to the one-state-space model algorithm because of GPU-based parallel computing. Experimental results show that the proposed architecture is a feasible solution in terms of high-performance and area-efficient architecture.
A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

Energy Technology Data Exchange (ETDEWEB)

Liu, T.; Ding, A.; Ji, W.; Xu, X. G. [Nuclear Engineering and Engineering Physics, Rensselaer Polytechnic Inst., Troy, NY 12180 (United States); Carothers, C. D. [Dept. of Computer Science, Rensselaer Polytechnic Inst. RPI (United States); Brown, F. B. [Los Alamos National Laboratory (LANL) (United States)

2012-07-01

Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

International Nuclear Information System (INIS)

Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.

2012-01-01

Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of ∼2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
R-GPU : A reconfigurable GPU architecture

NARCIS (Netherlands)

van den Braak, G.J.; Corporaal, H.

2016-01-01

Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU's Single Instruction/Multiple Threads
High-speed, multi-input, multi-output control using GPU processing in the HBT-EP tokamak

Energy Technology Data Exchange (ETDEWEB)

Rath, N., E-mail: Nikolaus@rath.org [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States); Bialek, J.; Byrne, P.J.; DeBono, B.; Levesque, J.P.; Li, B.; Mauel, M.E.; Maurer, D.A.; Navratil, G.A.; Shiraki, D. [Columbia University, Rm 200 Mudd, 500 W 120th St, New York, NY - 10027 (United States)

2012-12-15

Highlights: Black-Right-Pointing-Pointer We present a GPU based system for magnetic control of perturbed equilibria. Black-Right-Pointing-Pointer Cycle times are below 5 {mu}s and I/O latencies below 10 {mu}s for 96 inputs and 64 outputs. Black-Right-Pointing-Pointer A new architecture removes host RAM and CPU from the control cycle. Black-Right-Pointing-Pointer GPU and DA/AD modules operate independently and communicate via PCIe peer-to-peer connections. Black-Right-Pointing-Pointer The Linux host system does not require real-time extensions. - Abstract: We report on the design of a new plasma control system for the HBT-EP tokamak that utilizes a graphical processing unit (GPU) to magnetically control the 3D perturbed equilibrium state [1] of the plasma. The control system achieves cycle times of 5 {mu}s and I/O latencies below 10 {mu}s for up to 96 inputs and 64 outputs. The number of state variables is in the same order. To handle the resulting computational complexity under the given time constraints, the control algorithms are designed for massively parallel processing. The necessary hardware resources are provided by an NVIDIA Tesla M2050 GPU, offering a total of 448 computing cores running at 1.3 GHz each. A new control architecture allows control input from magnetic diagnostics to be pushed directly into GPU memory by a D-TACQ ACQ196 digitizer, and control output to be pulled directly from GPU memory by two D-TACQ AO32 analog output modules. By using peer-to-peer PCI express connections, this technique completely eliminates the use of host RAM and central processing unit (CPU) from the control cycle, permitting single-digit microsecond latencies on a standard Linux host system without any real-time extensions.
Sub-second pencil beam dose calculation on GPU for adaptive proton therapy.

Science.gov (United States)

da Silva, Joakim; Ansorge, Richard; Jena, Rajesh

2015-06-21

Although proton therapy delivered using scanned pencil beams has the potential to produce better dose conformity than conventional radiotherapy, the created dose distributions are more sensitive to anatomical changes and patient motion. Therefore, the introduction of adaptive treatment techniques where the dose can be monitored as it is being delivered is highly desirable. We present a GPU-based dose calculation engine relying on the widely used pencil beam algorithm, developed for on-line dose calculation. The calculation engine was implemented from scratch, with each step of the algorithm parallelized and adapted to run efficiently on the GPU architecture. To ensure fast calculation, it employs several application-specific modifications and simplifications, and a fast scatter-based implementation of the computationally expensive kernel superposition step. The calculation time for a skull base treatment plan using two beam directions was 0.22 s on an Nvidia Tesla K40 GPU, whereas a test case of a cubic target in water from the literature took 0.14 s to calculate. The accuracy of the patient dose distributions was assessed by calculating the γ-index with respect to a gold standard Monte Carlo simulation. The passing rates were 99.2% and 96.7%, respectively, for the 3%/3 mm and 2%/2 mm criteria, matching those produced by a clinical treatment planning system.

Spectral-element simulation of two-dimensional elastic wave propagation in fully heterogeneous media on a GPU cluster

Science.gov (United States)

Rudianto, Indra; Sudarmaji

2018-04-01

We present an implementation of the spectral-element method for simulation of two-dimensional elastic wave propagation in fully heterogeneous media. We have incorporated most of realistic geological features in the model, including surface topography, curved layer interfaces, and 2-D wave-speed heterogeneity. To accommodate such complexity, we use an unstructured quadrilateral meshing technique. Simulation was performed on a GPU cluster, which consists of 24 core processors Intel Xeon CPU and 4 NVIDIA Quadro graphics cards using CUDA and MPI implementation. We speed up the computation by a factor of about 5 compared to MPI only, and by a factor of about 40 compared to Serial implementation.
High-throughput protein crystallization on the World Community Grid and the GPU

International Nuclear Information System (INIS)

Kotseruba, Yulia; Cumbaa, Christian A; Jurisica, Igor

2012-01-01

We have developed CPU and GPU versions of an automated image analysis and classification system for protein crystallization trial images from the Hauptman Woodward Institute's High-Throughput Screening lab. The analysis step computes 12,375 numerical features per image. Using these features, we have trained a classifier that distinguishes 11 different crystallization outcomes, recognizing 80% of all crystals, 94% of clear drops, 94% of precipitates. The computing requirements for this analysis system are large. The complete HWI archive of 120 million images is being processed by the donated CPU cycles on World Community Grid, with a GPU phase launching in early 2012. The main computational burden of the analysis is the measure of textural (GLCM) features within the image at multiple neighbourhoods, distances, and at multiple greyscale intensity resolutions. CPU runtime averages 4,092 seconds (single threaded) on an Intel Xeon, but only 65 seconds on an NVIDIA Tesla C2050. We report on the process of adapting the C++ code to OpenCL, optimized for multiple platforms.
Personal Supercomputing for Monte Carlo Simulation Using a GPU

Energy Technology Data Exchange (ETDEWEB)

Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)

2008-05-15

Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation.
Personal Supercomputing for Monte Carlo Simulation Using a GPU

International Nuclear Information System (INIS)

Oh, Jae-Yong; Koo, Yang-Hyun; Lee, Byung-Ho

2008-01-01

Since the usability, accessibility, and maintenance of a personal computer (PC) are very good, a PC is a useful computer simulation tool for researchers. It has enough calculation power to simulate a small scale system with the improved performance of a PC's CPU. However, if a system is large or long time scale, we need a cluster computer or supercomputer. Recently great changes have occurred in the PC calculation environment. A graphic process unit (GPU) on a graphic card, only used to calculate display data, has a superior calculation capability to a PC's CPU. This GPU calculation performance is a match for the supercomputer in 2000. Although it has such a great calculation potential, it is not easy to program a simulation code for GPU due to difficult programming techniques for converting a calculation matrix to a 3D rendering image using graphic APIs. In 2006, NVIDIA provided the Software Development Kit (SDK) for the programming environment for NVIDIA's graphic cards, which is called the Compute Unified Device Architecture (CUDA). It makes the programming on the GPU easy without knowledge of the graphic APIs. This paper describes the basic architectures of NVIDIA's GPU and CUDA, and carries out a performance benchmark for the Monte Carlo simulation
THEWASP library. Thermodynamic water and steam properties library in GPU

International Nuclear Information System (INIS)

Waintraub, M.; Lapa, C.M.F.; Mol, A.C.A.; Heimlich, A.

2011-01-01

In this paper we present a new library for thermodynamic evaluation of water properties, THEWASP. This library consists of a C++ and CUDA based programs used to accelerate a function evaluation using GPU and GPU clusters. Global optimization problems need thousands of evaluations of the objective functions to nd the global optimum implying in several days of expensive processing. This problem motivates to seek a way to speed up our code, as well as to use MPI on Beowulf clusters, which however increases the cost in terms of electricity, air conditioning and others. The GPU based programming can accelerate the implementation up to 100 times and help increase the number of evaluations in global optimization problems using, for example, the PSO or DE Algorithms. THEWASP is based on Water-Steam formulations publish by the International Association for the properties of water and steam, Lucerne - Switzerland, and provides several temperature and pressure function evaluations, such as specific heat, specific enthalpy, specific entropy and also some inverse maps. In this study we evaluated the gain in speed and performance and compared it a CPU based processing library. (author)
Parallel fuzzy connected image segmentation on GPU.

Science.gov (United States)

Zhuge, Ying; Cao, Yong; Udupa, Jayaram K; Miller, Robert W

2011-07-01

Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA's compute unified device Architecture (CUDA) platform for segmenting medical image data sets. In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as CUDA kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set.
Multi-GPU Accelerated Admittance Method for High-Resolution Human Exposure Evaluation.

Science.gov (United States)

Xiong, Zubiao; Feng, Shi; Kautz, Richard; Chandra, Sandeep; Altunyurt, Nevin; Chen, Ji

2015-12-01

A multi-graphics processing unit (GPU) accelerated admittance method solver is presented for solving the induced electric field in high-resolution anatomical models of human body when exposed to external low-frequency magnetic fields. In the solver, the anatomical model is discretized as a three-dimensional network of admittances. The conjugate orthogonal conjugate gradient (COCG) iterative algorithm is employed to take advantage of the symmetric property of the complex-valued linear system of equations. Compared against the widely used biconjugate gradient stabilized method, the COCG algorithm can reduce the solving time by 3.5 times and reduce the storage requirement by about 40%. The iterative algorithm is then accelerated further by using multiple NVIDIA GPUs. The computations and data transfers between GPUs are overlapped in time by using asynchronous concurrent execution design. The communication overhead is well hidden so that the acceleration is nearly linear with the number of GPU cards. Numerical examples show that our GPU implementation running on four NVIDIA Tesla K20c cards can reach 90 times faster than the CPU implementation running on eight CPU cores (two Intel Xeon E5-2603 processors). The implemented solver is able to solve large dimensional problems efficiently. A whole adult body discretized in 1-mm resolution can be solved in just several minutes. The high efficiency achieved makes it practical to investigate human exposure involving a large number of cases with a high resolution that meets the requirements of international dosimetry guidelines.
Seismic Shot Processing on GPU

OpenAIRE

Johansen, Owe

2009-01-01

Today s petroleum industry demand an ever increasing amount of compu- tational resources. Seismic processing applications in use by these types of companies have generally been using large clusters of compute nodes, whose only computing resource has been the CPU. However, using Graphics Pro- cessing Units (GPU) for general purpose programming is these days becoming increasingly more popular in the high performance computing area. In 2007, NVIDIA corporation launched their framework for develo...
GPU - Accelerated Monte Carlo electron transport methods: development and application for radiation dose calculations using 6 GPU cards

International Nuclear Information System (INIS)

Su, L.; Du, X.; Liu, T.; Xu, X. G.

2013-01-01

An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software test-bed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs (Graphics Processing Units). This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. As for photon part, photoelectric effect, Compton scattering and pair production were simulated. Voxelized geometry was supported. A serial CPU (Central Processing Unit)code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6*10 6 electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy. (authors)
The gputools package enables GPU computing in R.

Science.gov (United States)

Buckner, Joshua; Wilson, Justin; Seligman, Mark; Athey, Brian; Watson, Stanley; Meng, Fan

2010-01-01

By default, the R statistical environment does not make use of parallelism. Researchers may resort to expensive solutions such as cluster hardware for large analysis tasks. Graphics processing units (GPUs) provide an inexpensive and computationally powerful alternative. Using R and the CUDA toolkit from Nvidia, we have implemented several functions commonly used in microarray gene expression analysis for GPU-equipped computers. R users can take advantage of the better performance provided by an Nvidia GPU. The package is available from CRAN, the R project's repository of packages, at http://cran.r-project.org/web/packages/gputools More information about our gputools R package is available at http://brainarray.mbni.med.umich.edu/brainarray/Rgpgpu
Semiempirical Quantum Chemical Calculations Accelerated on a Hybrid Multicore CPU-GPU Computing Platform.

Science.gov (United States)

Wu, Xin; Koslowski, Axel; Thiel, Walter

2012-07-10

In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Numerical Modeling of 3D Seismic Wave Propagation around Yogyakarta, the Southern Part of Central Java, Indonesia, Using Spectral-Element Method on MPI-GPU Cluster

Science.gov (United States)

Sudarmaji; Rudianto, Indra; Eka Nurcahya, Budi

2018-04-01

A strong tectonic earthquake with a magnitude of 5.9 Richter scale has been occurred in Yogyakarta and Central Java on May 26, 2006. The earthquake has caused severe damage in Yogyakarta and the southern part of Central Java, Indonesia. The understanding of seismic response of earthquake among ground shaking and the level of building damage is important. We present numerical modeling of 3D seismic wave propagation around Yogyakarta and the southern part of Central Java using spectral-element method on MPI-GPU (Graphics Processing Unit) computer cluster to observe its seismic response due to the earthquake. The homogeneous 3D realistic model is generated with detailed topography surface. The influences of free surface topography and layer discontinuity of the 3D model among the seismic response are observed. The seismic wave field is discretized using spectral-element method. The spectral-element method is solved on a mesh of hexahedral elements that is adapted to the free surface topography and the internal discontinuity of the model. To increase the data processing capabilities, the simulation is performed on a GPU cluster with implementation of MPI (Message Passing Interface).
irGPU.proton.Net: Irregular strong charge interaction networks of protonatable groups in protein molecules--a GPU solver using the fast multipole method and statistical thermodynamics.

Science.gov (United States)

Kantardjiev, Alexander A

2015-04-05

A cluster of strongly interacting ionization groups in protein molecules with irregular ionization behavior is suggestive for specific structure-function relationship. However, their computational treatment is unconventional (e.g., lack of convergence in naive self-consistent iterative algorithm). The stringent evaluation requires evaluation of Boltzmann averaged statistical mechanics sums and electrostatic energy estimation for each microstate. irGPU: Irregular strong interactions in proteins--a GPU solver is novel solution to a versatile problem in protein biophysics--atypical protonation behavior of coupled groups. The computational severity of the problem is alleviated by parallelization (via GPU kernels) which is applied for the electrostatic interaction evaluation (including explicit electrostatics via the fast multipole method) as well as statistical mechanics sums (partition function) estimation. Special attention is given to the ease of the service and encapsulation of theoretical details without sacrificing rigor of computational procedures. irGPU is not just a solution-in-principle but a promising practical application with potential to entice community into deeper understanding of principles governing biomolecule mechanisms. © 2015 Wiley Periodicals, Inc.
GPU computing and applications

CERN Document Server

See, Simon

2015-01-01

This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.
Parallel Computer System for 3D Visualization Stereo on GPU

Science.gov (United States)

Al-Oraiqat, Anas M.; Zori, Sergii A.

2018-03-01

This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Multi-Kepler GPU vs. multi-Intel MIC for spin systems simulations

Science.gov (United States)

Bernaschi, M.; Bisson, M.; Salvadore, F.

2014-10-01

We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.
Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

Directory of Open Access Journals (Sweden)

Zhou Zhe

2017-04-01

Full Text Available According to previous reports, information could be leaked from GPU memory; however, the security implications of such a threat were mostly over-looked, because only limited information could be indirectly extracted through side-channel attacks. In this paper, we propose a novel algorithm for recovering raw data directly from the GPU memory residues of many popular applications such as Google Chrome and Adobe PDF reader. Our algorithm enables harvesting highly sensitive information including credit card numbers and email contents from GPU memory residues. Evaluation results also indicate that nearly all GPU-accelerated applications are vulnerable to such attacks, and adversaries can launch attacks without requiring any special privileges both on traditional multi-user operating systems, and emerging cloud computing scenarios.
Manual of Tesla Experiments; Handbuch Tesla Experimente

Energy Technology Data Exchange (ETDEWEB)

Wahl, Guenter

2009-07-01

The first part, ''Making Lightning and Thunder'', describes a number of Tesla generators that can generate, e.g., coloured light arcs, ball lightning and swords of lightning. The second part, ''New Experiments with EMP, Tesla Waves and Microwaves'', presents a solid state Tesla generator for generating electrodynamic vortices and proposes circuiting alternatives to generate electromagnetic pulses (EMP). Further, mysterious Teslar wave, microwave and scalar wave generators are presented, as well as exotic Star Wars experiments like mass accelerators and plasma guns. The third section describes, among others, a tube-driven Tesla generator with 50 cm streamers. The reader will also find a catalogue of Messrs. Information Unlimited, USA, who are providers of many of the kits, circuiting diagrams and apparatuses presented here. (orig.) [German] Der erste Teil mit dem Titel ''Blitz und Donner selbst erzeugt'' beschreibt eine Reihe von Teslageneratoren, mit denen zum Beispiel bunte Lichtbogen, Kugelblitze und Blitzschwerter erzeugt werden koennen. Im zweiten Teil ''Neue Experimente mit EMPs, Tesla- and Mikrowellen'' findet der Leser einen Solid-State-Teslagenerator zur Erzeugung elektrodynamischer Wirbel sowie Schaltungsvorschlaege zum Thema ''Elektromagnetischer Impuls'' (EMP). Des Weiteren werden geheimnisumwitterte Tesla-, Mikro- und Skalarwellengeneratoren vorgestellt. Exotische Star-Wars-Experimente wie Massenbeschleuniger und Plasmakanonen fehlen ebenfalls nicht. Im dritten Teil wird unter anderem ein roehrenbetriebener Teslagenerator mit Streamern von 50 cm Laenge beschrieben. Ausserdem findet der Leser hier einen Katalog der US-Firma Information Unlimited, bei der viele im Buch besprochenen Bausaetze, Schaltplaene und Fertiggeraete bezogen werden koennen. (orig.)
Big Data GPU-Driven Parallel Processing Spatial and Spatio-Temporal Clustering Algorithms

Science.gov (United States)

Konstantaras, Antonios; Skounakis, Emmanouil; Kilty, James-Alexander; Frantzeskakis, Theofanis; Maravelakis, Emmanuel

2016-04-01

Advances in graphics processing units' technology towards encompassing parallel architectures [1], comprised of thousands of cores and multiples of parallel threads, provide the foundation in terms of hardware for the rapid processing of various parallel applications regarding seismic big data analysis. Seismic data are normally stored as collections of vectors in massive matrices, growing rapidly in size as wider areas are covered, denser recording networks are being established and decades of data are being compiled together [2]. Yet, many processes regarding seismic data analysis are performed on each seismic event independently or as distinct tiles [3] of specific grouped seismic events within a much larger data set. Such processes, independent of one another can be performed in parallel narrowing down processing times drastically [1,3]. This research work presents the development and implementation of three parallel processing algorithms using Cuda C [4] for the investigation of potentially distinct seismic regions [5,6] present in the vicinity of the southern Hellenic seismic arc. The algorithms, programmed and executed in parallel comparatively, are the: fuzzy k-means clustering with expert knowledge [7] in assigning overall clusters' number; density-based clustering [8]; and a selves-developed spatio-temporal clustering algorithm encompassing expert [9] and empirical knowledge [10] for the specific area under investigation. Indexing terms: GPU parallel programming, Cuda C, heterogeneous processing, distinct seismic regions, parallel clustering algorithms, spatio-temporal clustering References [1] Kirk, D. and Hwu, W.: 'Programming massively parallel processors - A hands-on approach', 2nd Edition, Morgan Kaufman Publisher, 2013 [2] Konstantaras, A., Valianatos, F., Varley, M.R. and Makris, J.P.: 'Soft-Computing Modelling of Seismicity in the Southern Hellenic Arc', Geoscience and Remote Sensing Letters, vol. 5 (3), pp. 323-327, 2008 [3] Papadakis, S. and
Implementation of metal-friendly EAM/FS-type semi-empirical potentials in HOOMD-blue: A GPU-accelerated molecular dynamics software

Science.gov (United States)

Yang, Lin; Zhang, Feng; Wang, Cai-Zhuang; Ho, Kai-Ming; Travesset, Alex

2018-04-01

We present an implementation of EAM and FS interatomic potentials, which are widely used in simulating metallic systems, in HOOMD-blue, a software designed to perform classical molecular dynamics simulations using GPU accelerations. We first discuss the details of our implementation and then report extensive benchmark tests. We demonstrate that single-precision floating point operations efficiently implemented on GPUs can produce sufficient accuracy when compared against double-precision codes, as demonstrated in test simulations of calculations of the glass-transition temperature of Cu64.5Zr35.5, and pair correlation function g (r) of liquid Ni3Al. Our code scales well with the size of the simulating system on NVIDIA Tesla M40 and P100 GPUs. Compared with another popular software LAMMPS running on 32 cores of AMD Opteron 6220 processors, the GPU/CPU performance ratio can reach as high as 4.6. The source code can be accessed through the HOOMD-blue web page for free by any interested user.

Cost-effective GPU-grid for genome-wide epistasis calculations.

Science.gov (United States)

Pütz, B; Kam-Thong, T; Karbalai, N; Altmann, A; Müller-Myhsok, B

2013-01-01

Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account [1]. It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally. The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure. Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost. A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores. The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

Energy Technology Data Exchange (ETDEWEB)

Xu, Chuanfu, E-mail: xuchuanfu@nudt.edu.cn [College of Computer Science, National University of Defense Technology, Changsha 410073 (China); Deng, Xiaogang; Zhang, Lilun [College of Computer Science, National University of Defense Technology, Changsha 410073 (China); Fang, Jianbin [Parallel and Distributed Systems Group, Delft University of Technology, Delft 2628CD (Netherlands); Wang, Guangxue; Jiang, Yi [State Key Laboratory of Aerodynamics, P.O. Box 211, Mianyang 621000 (China); Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua [College of Computer Science, National University of Defense Technology, Changsha 410073 (China)

2014-12-01

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations
Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

International Nuclear Information System (INIS)

Xu, Chuanfu; Deng, Xiaogang; Zhang, Lilun; Fang, Jianbin; Wang, Guangxue; Jiang, Yi; Cao, Wei; Che, Yonggang; Wang, Yongxian; Wang, Zhenghua; Liu, Wei; Cheng, Xinghua

2014-01-01

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations for high-order CFD schemes. The GPU-only approach achieves a speedup of about 1.3 when comparing one Tesla M2050 GPU with two Xeon X5670 CPUs. To achieve a greater speedup, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present a novel scheme to balance the loads between the store-poor GPU and the store-rich CPU. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per TianHe-1A node for HOSTA by 2.3×, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Further, to scale HOSTA on TianHe-1A, we propose a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 TianHe-1A nodes. With our method, we have successfully simulated an EET high-lift airfoil configuration containing 800M cells and China's large civil airplane configuration containing 150M cells. To our best knowledge, those are the largest-scale CPU–GPU collaborative simulations
Non-enhanced MR imaging of cerebral aneurysms: 7 Tesla versus 1.5 Tesla.

Science.gov (United States)

Wrede, Karsten H; Dammann, Philipp; Mönninghoff, Christoph; Johst, Sören; Maderwald, Stefan; Sandalcioglu, I Erol; Müller, Oliver; Özkan, Neriman; Ladd, Mark E; Forsting, Michael; Schlamann, Marc U; Sure, Ulrich; Umutlu, Lale

2014-01-01

To prospectively evaluate 7 Tesla time-of-flight (TOF) magnetic resonance angiography (MRA) in comparison to 1.5 Tesla TOF MRA and 7 Tesla non-contrast enhanced magnetization-prepared rapid acquisition gradient-echo (MPRAGE) for delineation of unruptured intracranial aneurysms (UIA). Sixteen neurosurgical patients (male n = 5, female n = 11) with single or multiple UIA were enrolled in this trial. All patients were accordingly examined at 7 Tesla and 1.5 Tesla MRI utilizing dedicated head coils. The following sequences were obtained: 7 Tesla TOF MRA, 1.5 Tesla TOF MRA and 7 Tesla non-contrast enhanced MPRAGE. Image analysis was performed by two radiologists with regard to delineation of aneurysm features (dome, neck, parent vessel), presence of artifacts, vessel-tissue-contrast and overall image quality. Interobserver accordance and intermethod comparisons were calculated by kappa coefficient and Lin's concordance correlation coefficient. A total of 20 intracranial aneurysms were detected in 16 patients, with two patients showing multiple aneurysms (n = 2, n = 4). Out of 20 intracranial aneurysms, 14 aneurysms were located in the anterior circulation and 6 aneurysms in the posterior circulation. 7 Tesla MPRAGE imaging was superior over 1.5 and 7 Tesla TOF MRA in the assessment of all considered aneurysm and image quality features (e.g. image quality: mean MPRAGE7T: 5.0; mean TOF7T: 4.3; mean TOF1.5T: 4.3). Ratings for 7 Tesla TOF MRA were equal or higher over 1.5 Tesla TOF MRA for all assessed features except for artifact delineation (mean TOF7T: 4.3; mean TOF1.5T 4.4). Interobserver accordance was good to excellent for most ratings. 7 Tesla MPRAGE imaging demonstrated its superiority in the detection and assessment of UIA as well as overall imaging features, offering excellent interobserver accordance and highest scores for all ratings. Hence, it may bear the potential to serve as a high-quality diagnostic tool for pretherapeutic assessment and
Graphics Processing Unit Enhanced Parallel Document Flocking Clustering

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL; ST Charles, Jesse Lee [ORNL

2010-01-01

Analyzing and clustering documents is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to generate results in a reasonable amount of time. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. In this paper, we have conducted research to exploit this archi- tecture and apply its strengths to the flocking based document clustering problem. Using the CUDA platform from NVIDIA, we developed a doc- ument flocking implementation to be run on the NVIDIA GEFORCE GPU. Performance gains ranged from thirty-six to nearly sixty times improvement of the GPU over the CPU implementation.
A Kepler Workflow Tool for Reproducible AMBER GPU Molecular Dynamics.

Science.gov (United States)

Purawat, Shweta; Ieong, Pek U; Malmstrom, Robert D; Chan, Garrett J; Yeung, Alan K; Walker, Ross C; Altintas, Ilkay; Amaro, Rommie E

2017-06-20

With the drive toward high throughput molecular dynamics (MD) simulations involving ever-greater numbers of simulation replicates run for longer, biologically relevant timescales (microseconds), the need for improved computational methods that facilitate fully automated MD workflows gains more importance. Here we report the development of an automated workflow tool to perform AMBER GPU MD simulations. Our workflow tool capitalizes on the capabilities of the Kepler platform to deliver a flexible, intuitive, and user-friendly environment and the AMBER GPU code for a robust and high-performance simulation engine. Additionally, the workflow tool reduces user input time by automating repetitive processes and facilitates access to GPU clusters, whose high-performance processing power makes simulations of large numerical scale possible. The presented workflow tool facilitates the management and deployment of large sets of MD simulations on heterogeneous computing resources. The workflow tool also performs systematic analysis on the simulation outputs and enhances simulation reproducibility, execution scalability, and MD method development including benchmarking and validation. Copyright © 2017 Biophysical Society. Published by Elsevier Inc. All rights reserved.
GRAVIDY, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening

Science.gov (United States)

Maureira-Fredes, Cristián; Amaro-Seoane, Pau

2018-01-01

A wide variety of outstanding problems in astrophysics involve the motion of a large number of particles under the force of gravity. These include the global evolution of globular clusters, tidal disruptions of stars by a massive black hole, the formation of protoplanets and sources of gravitational radiation. The direct-summation of N gravitational forces is a complex problem with no analytical solution and can only be tackled with approximations and numerical methods. To this end, the Hermite scheme is a widely used integration method. With different numerical techniques and special-purpose hardware, it can be used to speed up the calculations. But these methods tend to be computationally slow and cumbersome to work with. We present a new graphics processing unit (GPU), direct-summation N-body integrator written from scratch and based on this scheme, which includes relativistic corrections for sources of gravitational radiation. GRAVIDY has high modularity, allowing users to readily introduce new physics, it exploits available computational resources and will be maintained by regular updates. GRAVIDY can be used in parallel on multiple CPUs and GPUs, with a considerable speed-up benefit. The single-GPU version is between one and two orders of magnitude faster than the single-CPU version. A test run using four GPUs in parallel shows a speed-up factor of about 3 as compared to the single-GPU version. The conception and design of this first release is aimed at users with access to traditional parallel CPU clusters or computational nodes with one or a few GPU cards.
SU-F-T-256: 4D IMRT Planning Using An Early Prototype GPU-Enabled Eclipse Workstation

Energy Technology Data Exchange (ETDEWEB)

Hagan, A; Modiri, A; Sawant, A [University of Maryland in Baltimore, Baltimore, MD (United States); Svatos, M [Varian Medical Systems, Palo Alto, CA (United States)

2016-06-15

Purpose: True 4D IMRT planning, based on simultaneous spatiotemporal optimization has been shown to significantly improve plan quality in lung radiotherapy. However, the high computational complexity associated with such planning represents a significant barrier to widespread clinical deployment. We introduce an early prototype GPU-enabled Eclipse workstation for inverse planning. To our knowledge, this is the first GPUintegrated Eclipse system demonstrating the potential for clinical translation of GPU computing on a major commercially-available TPS. Methods: The prototype system comprised of four NVIDIA Tesla K80 GPUs, with a maximum processing capability of 8.5 Tflops per K80 card. The system architecture consisted of three key modules: (i) a GPU-based inverse planning module using a highly-parallelizable, swarm intelligence-based global optimization algorithm, (ii) a GPU-based open-source b-spline deformable image registration module, Elastix, and (iii) a CUDA-based data management module. For evaluation, aperture fluence weights in an IMRT plan were optimized over 9 beams,166 apertures and 10 respiratory phases (14940 variables) for a lung cancer case (GTV = 95 cc, right lower lobe, 15 mm cranio-caudal motion). Sensitivity of the planning time and memory expense to parameter variations was quantified. Results: GPU-based inverse planning was significantly accelerated compared to its CPU counterpart (36 vs 488 min, for 10 phases, 10 search agents and 10 iterations). The optimized IMRT plan significantly improved OAR sparing compared to the original internal target volume (ITV)-based clinical plan, while maintaining prescribed tumor coverage. The dose-sparing improvements were: Esophagus Dmax 50%, Heart Dmax 42% and Spinal cord Dmax 25%. Conclusion: Our early prototype system demonstrates that through massive parallelization, computationally intense tasks such as 4D treatment planning can be accomplished in clinically feasible timeframes. With further
SU-F-T-256: 4D IMRT Planning Using An Early Prototype GPU-Enabled Eclipse Workstation

International Nuclear Information System (INIS)

Hagan, A; Modiri, A; Sawant, A; Svatos, M

2016-01-01

Purpose: True 4D IMRT planning, based on simultaneous spatiotemporal optimization has been shown to significantly improve plan quality in lung radiotherapy. However, the high computational complexity associated with such planning represents a significant barrier to widespread clinical deployment. We introduce an early prototype GPU-enabled Eclipse workstation for inverse planning. To our knowledge, this is the first GPUintegrated Eclipse system demonstrating the potential for clinical translation of GPU computing on a major commercially-available TPS. Methods: The prototype system comprised of four NVIDIA Tesla K80 GPUs, with a maximum processing capability of 8.5 Tflops per K80 card. The system architecture consisted of three key modules: (i) a GPU-based inverse planning module using a highly-parallelizable, swarm intelligence-based global optimization algorithm, (ii) a GPU-based open-source b-spline deformable image registration module, Elastix, and (iii) a CUDA-based data management module. For evaluation, aperture fluence weights in an IMRT plan were optimized over 9 beams,166 apertures and 10 respiratory phases (14940 variables) for a lung cancer case (GTV = 95 cc, right lower lobe, 15 mm cranio-caudal motion). Sensitivity of the planning time and memory expense to parameter variations was quantified. Results: GPU-based inverse planning was significantly accelerated compared to its CPU counterpart (36 vs 488 min, for 10 phases, 10 search agents and 10 iterations). The optimized IMRT plan significantly improved OAR sparing compared to the original internal target volume (ITV)-based clinical plan, while maintaining prescribed tumor coverage. The dose-sparing improvements were: Esophagus Dmax 50%, Heart Dmax 42% and Spinal cord Dmax 25%. Conclusion: Our early prototype system demonstrates that through massive parallelization, computationally intense tasks such as 4D treatment planning can be accomplished in clinically feasible timeframes. With further
GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation

International Nuclear Information System (INIS)

Jia Xun; Lou Yifei; Li Ruijiang; Song, William Y.; Jiang, Steve B.

2010-01-01

Purpose: Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. Methods: The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. Results: It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of ∼360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. Conclusions: This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.
GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation.

Science.gov (United States)

Jia, Xun; Lou, Yifei; Li, Ruijiang; Song, William Y; Jiang, Steve B

2010-04-01

Cone-beam CT (CBCT) plays an important role in image guided radiation therapy (IGRT). However, the large radiation dose from serial CBCT scans in most IGRT procedures raises a clinical concern, especially for pediatric patients who are essentially excluded from receiving IGRT for this reason. The goal of this work is to develop a fast GPU-based algorithm to reconstruct CBCT from undersampled and noisy projection data so as to lower the imaging dose. The CBCT is reconstructed by minimizing an energy functional consisting of a data fidelity term and a total variation regularization term. The authors developed a GPU-friendly version of the forward-backward splitting algorithm to solve this model. A multigrid technique is also employed. It is found that 20-40 x-ray projections are sufficient to reconstruct images with satisfactory quality for IGRT. The reconstruction time ranges from 77 to 130 s on an NVIDIA Tesla C1060 (NVIDIA, Santa Clara, CA) GPU card, depending on the number of projections used, which is estimated about 100 times faster than similar iterative reconstruction approaches. Moreover, phantom studies indicate that the algorithm enables the CBCT to be reconstructed under a scanning protocol with as low as 0.1 mA s/projection. Comparing with currently widely used full-fan head and neck scanning protocol of approximately 360 projections with 0.4 mA s/projection, it is estimated that an overall 36-72 times dose reduction has been achieved in our fast CBCT reconstruction algorithm. This work indicates that the developed GPU-based CBCT reconstruction algorithm is capable of lowering imaging dose considerably. The high computation efficiency in this algorithm makes the iterative CBCT reconstruction approach applicable in real clinical environments.
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method

Science.gov (United States)

Januszewski, M.; Kostur, M.

2014-09-01

We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
Locality-Aware CTA Clustering For Modern GPUs

Energy Technology Data Exchange (ETDEWEB)

Li, Ang; Song, Shuaiwen; Liu, Weifeng; Liu, Xu; Kumar, Akash; Corporaal, Henk

2017-04-08

In this paper, we proposed a novel clustering technique for tapping into the performance potential of a largely ignored type of locality: inter-CTA locality. We first demonstrated the capability of the existing GPU hardware to exploit such locality, both spatially and temporally, on L1 or L1/Tex unified cache. To verify the potential of this locality, we quantified its existence in a broad spectrum of applications and discussed its sources of origin. Based on these insights, we proposed the concept of CTA-Clustering and its associated software techniques. Finally, We evaluated these techniques on all modern generations of NVIDIA GPU architectures. The experimental results showed that our proposed clustering techniques could significantly improve on-chip cache performance.
High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

International Nuclear Information System (INIS)

Nieto, J.; Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M.; Vega, J.; Castro, R.

2016-01-01

Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

Energy Technology Data Exchange (ETDEWEB)

Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)

2016-11-15

Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
Nell’anno di Nikola Tesla

Directory of Open Access Journals (Sweden)

Persida Lazarević Di Giacomo

2006-12-01

Full Text Available The 150th Anniversary Celebration of Nikola Tesla The paper deals with the 150th anniversary celebration of Nikola Tesla, the greatest ever Yugoslav scientist. Due to his origins, Tesla is contended between Serbia and Croatia, but he is also considered to be an American scientist since he registered most of his patents in the USA. Although there is some dissension between Serbia and Croatia regarding the historical facts about Tesla’s life, it can be asserted that Tesla appears to be an example of collaboration among the former Yugoslav countries. Tesla, however, deserves to be remembered as well as the author of autobiographical prose (My Inventions; Some Personal Recollections; A Strange Experience by Nikola Tesla and as the protagonist of many works published in Serbo-Croatian and in English, such as Miloš Crnjanski’s drama Tesla (1969 or The Hunger and Ecstasy of Vampires (1996 by Brian M. Stableford, for example. This fact should not be neglected and actually should be researched more fully.
Integrating the Nqueens Algorithm into a Parameterized Benchmark Suite

Science.gov (United States)

2016-02-01

FOB is a 64-node heterogeneous cluster consisting of 16-IBM dx360M4 nodes, each with one NVIDIA Kepler K20M GPUs and 48-IBM dx360M4 nodes, and each...nodes have 256-GB of memory and an NVIDIA Tesla K40 GPU. More details on Excalibur can be found on the US Army DSRC website.19 Figures 3 and 4 show the
[70 years of Nikola Tesla studies].

Science.gov (United States)

Juznic, Stanislav

2013-01-01

Nikola Tesla's studies of chemistry are described including his not very scholarly affair in Maribor. After almost a century and half of hypothesis at least usable scenario of Tesla's life and "work" in Maribor is provided. The chemistry achievements of Tesla's most influential professors Martin Sekulić and Tesla's Graz professors are put into the limelight. The fact that Tesla in Graz studied on the technological chemistry Faculty of Polytechnic is focused.
The TESLA RF System

International Nuclear Information System (INIS)

Choroba, S.

2003-01-01

The TESLA project proposed by the TESLA collaboration in 2001 is a 500 to 800GeV e+/e- linear collider with integrated free electron laser facility. The accelerator is based on superconducting cavity technology. Approximately 20000 superconducting cavities operated at 1.3GHz with a gradient of 23.4MV/m or 35MV/m will be required to achieve the energy of 500GeV or 800GeV respectively. For 500GeV ∼600 RF stations each generating 10MW of RF power at 1.3GHz at a pulse duration of 1.37ms and a repetition rate of 5 or 10Hz are required. The original TESLA design was modified in 2002 and now includes a dedicated 20GeV electron accelerator in a separate tunnel for free electron laser application. The TESLA XFEL will provide XFEL radiation of unprecedented peak brilliance and full transverse coherence in the wavelength range of 0.1 to 6.4nm at a pulse duration of 100fs. The technology of both accelerators, the TESLA linear collider and the XFEL, will be identical, however the number of superconducting cavities and RF stations for the XFEL will be reduced to 936 and 26 respectively. This paper describes the layout of the entire RF system of the TESLA linear collider and the TESLA XFEL and gives an overview of its various subsystems and components
GPU PRO 3 Advanced rendering techniques

CERN Document Server

Engel, Wolfgang

2012-01-01

GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha

Nell’anno di Nikola Tesla

OpenAIRE

Persida Lazarević Di Giacomo

2006-01-01

The 150th Anniversary Celebration of Nikola Tesla The paper deals with the 150th anniversary celebration of Nikola Tesla, the greatest ever Yugoslav scientist. Due to his origins, Tesla is contended between Serbia and Croatia, but he is also considered to be an American scientist since he registered most of his patents in the USA. Although there is some dissension between Serbia and Croatia regarding the historical facts about Tesla’s life, it can be asserted that Tesla appears to be an...
Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

Science.gov (United States)

2012-03-22

reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU
A GPU accelerated and error-controlled solver for the unbounded Poisson equation in three dimensions

Science.gov (United States)

Exl, Lukas

2017-12-01

An efficient solver for the three dimensional free-space Poisson equation is presented. The underlying numerical method is based on finite Fourier series approximation. While the error of all involved approximations can be fully controlled, the overall computation error is driven by the convergence of the finite Fourier series of the density. For smooth and fast-decaying densities the proposed method will be spectrally accurate. The method scales with O(N log N) operations, where N is the total number of discretization points in the Cartesian grid. The majority of the computational costs come from fast Fourier transforms (FFT), which makes it ideal for GPU computation. Several numerical computations on CPU and GPU validate the method and show efficiency and convergence behavior. Tests are performed using the Vienna Scientific Cluster 3 (VSC3). A free MATLAB implementation for CPU and GPU is provided to the interested community.
Implementation of EAM and FS potentials in HOOMD-blue

Science.gov (United States)

Yang, Lin; Zhang, Feng; Travesset, Alex; Wang, Caizhuang; Ho, Kaiming

HOOMD-blue is a general-purpose software to perform classical molecular dynamics simulations entirely on GPUs. We provide full support for EAM and FS type potentials in HOOMD-blue, and report accuracy and efficiency benchmarks, including comparisons with the LAMMPS GPU package. Two problems were selected to test the accuracy: the determination of the glass transition temperature of Cu64.5Zr35.5 alloy using an FS potential and the calculation of pair distribution functions of Ni3Al using an EAM potential. In both cases, the results using HOOMD-blue are indistinguishable from those obtained by the GPU package in LAMMPS within statistical uncertainties. As tests for time efficiency, we benchmark time-steps per second using LAMMPS GPU and HOOMD-blue on one NVIDIA Tesla GPU. Compared to our typical LAMMPS simulations on one CPU cluster node which has 16 CPUs, LAMMPS GPU can be 3-3.5 times faster, and HOOMD-blue can be 4-5.5 times faster. We acknowledge the support from Laboratory Directed Research and Development (LDRD) of Ames Laboratory.
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Energy Technology Data Exchange (ETDEWEB)

Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL

2015-01-01

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
Linear Colliders TESLA

International Nuclear Information System (INIS)

Anon.

1994-01-01

The aim of the TESLA (TeV Superconducting Linear Accelerator) collaboration (at present 19 institutions from seven countries) is to establish the technology for a high energy electron-positron linear collider using superconducting radiofrequency cavities to accelerate its beams. Another basic goal is to demonstrate that such a collider can meet its performance goals in a cost effective manner. For this the TESLA collaboration is preparing a 500 MeV superconducting linear test accelerator at the DESY Laboratory in Hamburg. This TTF (TESLA Test Facility) consists of four cryomodules, each approximately 12 m long and containing eight 9-cell solid niobium cavities operating at a frequency of 1.3 GHz
GPU Computing Gems Emerald Edition

CERN Document Server

Hwu, Wen-mei W

2011-01-01

".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of
GPU-Accelerated Large-Scale Electronic Structure Theory on Titan with a First-Principles All-Electron Code

Science.gov (United States)

Huhn, William Paul; Lange, Björn; Yu, Victor; Blum, Volker; Lee, Seyong; Yoon, Mina

Density-functional theory has been well established as the dominant quantum-mechanical computational method in the materials community. Large accurate simulations become very challenging on small to mid-scale computers and require high-performance compute platforms to succeed. GPU acceleration is one promising approach. In this talk, we present a first implementation of all-electron density-functional theory in the FHI-aims code for massively parallel GPU-based platforms. Special attention is paid to the update of the density and to the integration of the Hamiltonian and overlap matrices, realized in a domain decomposition scheme on non-uniform grids. The initial implementation scales well across nodes on ORNL's Titan Cray XK7 supercomputer (8 to 64 nodes, 16 MPI ranks/node) and shows an overall speed up in runtime due to utilization of the K20X Tesla GPUs on each Titan node of 1.4x, with the charge density update showing a speed up of 2x. Further acceleration opportunities will be discussed. Work supported by the LDRD Program of ORNL managed by UT-Battle, LLC, for the U.S. DOE and by the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
Acceleration of PIC simulation with GPU

International Nuclear Information System (INIS)

Suzuki, Junya; Shimazu, Hironori; Fukazawa, Keiichiro; Den, Mitsue

2011-01-01

Particle-in-cell (PIC) is a simulation technique for plasma physics. The large number of particles in high-resolution plasma simulation increases the volume computation required, making it vital to increase computation speed. In this study, we attempt to accelerate computation speed on graphics processing units (GPUs) using KEMPO, a PIC simulation code package. We perform two tests for benchmarking, with small and large grid sizes. In these tests, we run KEMPO1 code using a CPU only, both a CPU and a GPU, and a GPU only. The results showed that performance using only a GPU was twice that of using a CPU alone. While, execution time for using both a CPU and GPU is comparable to the tests with a CPU alone, because of the significant bottleneck in communication between the CPU and GPU. (author)
Application of GPU to computational multiphase fluid dynamics

International Nuclear Information System (INIS)

Nagatake, T; Kunugi, T

2010-01-01

The MARS (Multi-interfaces Advection and Reconstruction Solver) [1] is one of the surface volume tracking methods for multi-phase flows. Nowadays, the performance of GPU (Graphics Processing Unit) is much higher than the CPU (Central Processing Unit). In this study, the GPU was applied to the MARS in order to accelerate the computation of multi-phase flows (GPU-MARS), and the performance of the GPU-MARS was discussed. From the performance of the interface tracking method for the analyses of one-directional advection problem, it is found that the computing time of GPU(single GTX280) was around 4 times faster than that of the CPU (Xeon 5040, 4 threads parallelized). From the performance of Poisson Solver by using the algorithm developed in this study, it is found that the performance of the GPU showed around 30 times faster than that of the CPU. Finally, it is confirmed that the GPU showed the large acceleration of the fluid flow computation (GPU-MARS) compared to the CPU. However, it is also found that the double-precision computation of the GPU must perform with very high precision.
Accelerating solidification process simulation for large-sized system of liquid metal atoms using GPU with CUDA

Energy Technology Data Exchange (ETDEWEB)

Jie, Liang [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); Li, KenLi, E-mail: lkl@hnu.edu.cn [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); National Supercomputing Center in Changsha, 410082 (China); Shi, Lin [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China); Liu, RangSu [School of Physics and Micro Electronic, Hunan University, Changshang, 410082 (China); Mei, Jing [School of Information Science and Engineering, Hunan University, Changshang, 410082 (China)

2014-01-15

Molecular dynamics simulation is a powerful tool to simulate and analyze complex physical processes and phenomena at atomic characteristic for predicting the natural time-evolution of a system of atoms. Precise simulation of physical processes has strong requirements both in the simulation size and computing timescale. Therefore, finding available computing resources is crucial to accelerate computation. However, a tremendous computational resource (GPGPU) are recently being utilized for general purpose computing due to its high performance of floating-point arithmetic operation, wide memory bandwidth and enhanced programmability. As for the most time-consuming component in MD simulation calculation during the case of studying liquid metal solidification processes, this paper presents a fine-grained spatial decomposition method to accelerate the computation of update of neighbor lists and interaction force calculation by take advantage of modern graphics processors units (GPU), enlarging the scale of the simulation system to a simulation system involving 10 000 000 atoms. In addition, a number of evaluations and tests, ranging from executions on different precision enabled-CUDA versions, over various types of GPU (NVIDIA 480GTX, 580GTX and M2050) to CPU clusters with different number of CPU cores are discussed. The experimental results demonstrate that GPU-based calculations are typically 9∼11 times faster than the corresponding sequential execution and approximately 1.5∼2 times faster than 16 CPU cores clusters implementations. On the basis of the simulated results, the comparisons between the theoretical results and the experimental ones are executed, and the good agreement between the two and more complete and larger cluster structures in the actual macroscopic materials are observed. Moreover, different nucleation and evolution mechanism of nano-clusters and nano-crystals formed in the processes of metal solidification is observed with large
k-t SENSE-accelerated Myocardial Perfusion MR Imaging at 3.0 Tesla - comparison with 1.5 Tesla

Science.gov (United States)

Plein, Sven; Schwitter, Juerg; Suerder, Daniel; Greenwood, John P.; Boesiger, Peter; Kozerke, Sebastian

2008-01-01

Purpose To determine the feasibility and diagnostic accuracy of high spatial resolution myocardial perfusion MR at 3.0 Tesla using k-space and time domain undersampling with sensitivity encoding (k-t SENSE). Materials and Methods The study was reviewed and approved by the local ethic review board. k-t SENSE perfusion MR was performed at 1.5 Tesla and 3.0 Tesla (saturation recovery gradient echo pulse sequence, repetition time/echo time 3.0ms/1.0ms, flip angle 15°, 5x k-t SENSE acceleration, spatial resolution 1.3×1.3×10mm3). Fourteen volunteers were studied at rest and 37 patients during adenosine stress. In volunteers, comparison was also made with standard-resolution (2.5×2.5×10mm3) 2x SENSE perfusion MR at 3.0 Tesla. Image quality, artifact scores, signal-to-noise ratios (SNR) and contrast-enhancement ratios (CER) were derived. In patients, diagnostic accuracy of visual analysis to detect >50% diameter stenosis on quantitative coronary angiography was determined by receiver-operator-characteristics (ROC). Results In volunteers, image quality and artifact scores were similar for 3.0 Tesla and 1.5 Tesla, while SNR was higher (11.6 vs. 5.6) and CER lower (1.1 vs. 1.5, p=0.012) at 3.0 Tesla. Compared with standard-resolution perfusion MR, image quality was higher for k-t SENSE (3.6 vs. 3.1, p=0.04), endocardial dark rim artifacts were reduced (artifact thickness 1.6mm vs. 2.4mm, pTesla and 1.5 Tesla, respectively. Conclusions k-t SENSE accelerated high-resolution perfusion MR at 3.0 Tesla is feasible with similar artifacts and diagnostic accuracy as at 1.5 Tesla. Compared with standard-resolution perfusion MR, image quality is improved and artifacts are reduced. PMID:18936311
TESLA project goes public

International Nuclear Information System (INIS)

Flegel', I.

2002-01-01

The TESLA project connected with the creation of superconducting linear accelerator with colliding neutron and positron beams in the DESY Laboratory (Hamburg) is presented. Scientists of 36 countries make a contribution in the Feasibility study of new accelerator construction. Creation of new accelerator will open the way to the investigation into new elementary particles; TESLA is perfectly suitable for the production of Higgs particles. Exact measurements on the unit will allow to research into properties of supersymmetrical particles. The TESLA project involves the creation of X-ray free electron laser [ru
SU-E-T-36: A GPU-Accelerated Monte-Carlo Dose Calculation Platform and Its Application Toward Validating a ViewRay Beam Model

Energy Technology Data Exchange (ETDEWEB)

Wang, Y; Mazur, T; Green, O; Hu, Y; Wooten, H; Yang, D; Zhao, T; Mutic, S; Li, H [Washington University School of Medicine, St. Louis, MO (United States)

2015-06-15

Purpose: To build a fast, accurate and easily-deployable research platform for Monte-Carlo dose calculations. We port the dose calculation engine PENELOPE to C++, and accelerate calculations using GPU acceleration. Simulations of a Co-60 beam model provided by ViewRay demonstrate the capabilities of the platform. Methods: We built software that incorporates a beam model interface, CT-phantom model, GPU-accelerated PENELOPE engine, and GUI front-end. We rewrote the PENELOPE kernel in C++ (from Fortran) and accelerated the code on a GPU. We seamlessly integrated a Co-60 beam model (obtained from ViewRay) into our platform. Simulations of various field sizes and SSDs using a homogeneous water phantom generated PDDs, dose profiles, and output factors that were compared to experiment data. Results: With GPU acceleration using a dated graphics card (Nvidia Tesla C2050), a highly accurate simulation – including 100*100*100 grid, 3×3×3 mm3 voxels, <1% uncertainty, and 4.2×4.2 cm2 field size – runs 24 times faster (20 minutes versus 8 hours) than when parallelizing on 8 threads across a new CPU (Intel i7-4770). Simulated PDDs, profiles and output ratios for the commercial system agree well with experiment data measured using radiographic film or ionization chamber. Based on our analysis, this beam model is precise enough for general applications. Conclusions: Using a beam model for a Co-60 system provided by ViewRay, we evaluate a dose calculation platform that we developed. Comparison to measurements demonstrates the promise of our software for use as a research platform for dose calculations, with applications including quality assurance and treatment plan verification.
SU-E-T-36: A GPU-Accelerated Monte-Carlo Dose Calculation Platform and Its Application Toward Validating a ViewRay Beam Model

International Nuclear Information System (INIS)

Wang, Y; Mazur, T; Green, O; Hu, Y; Wooten, H; Yang, D; Zhao, T; Mutic, S; Li, H

2015-01-01

Purpose: To build a fast, accurate and easily-deployable research platform for Monte-Carlo dose calculations. We port the dose calculation engine PENELOPE to C++, and accelerate calculations using GPU acceleration. Simulations of a Co-60 beam model provided by ViewRay demonstrate the capabilities of the platform. Methods: We built software that incorporates a beam model interface, CT-phantom model, GPU-accelerated PENELOPE engine, and GUI front-end. We rewrote the PENELOPE kernel in C++ (from Fortran) and accelerated the code on a GPU. We seamlessly integrated a Co-60 beam model (obtained from ViewRay) into our platform. Simulations of various field sizes and SSDs using a homogeneous water phantom generated PDDs, dose profiles, and output factors that were compared to experiment data. Results: With GPU acceleration using a dated graphics card (Nvidia Tesla C2050), a highly accurate simulation – including 100*100*100 grid, 3×3×3 mm3 voxels, <1% uncertainty, and 4.2×4.2 cm2 field size – runs 24 times faster (20 minutes versus 8 hours) than when parallelizing on 8 threads across a new CPU (Intel i7-4770). Simulated PDDs, profiles and output ratios for the commercial system agree well with experiment data measured using radiographic film or ionization chamber. Based on our analysis, this beam model is precise enough for general applications. Conclusions: Using a beam model for a Co-60 system provided by ViewRay, we evaluate a dose calculation platform that we developed. Comparison to measurements demonstrates the promise of our software for use as a research platform for dose calculations, with applications including quality assurance and treatment plan verification
Travel Software using GPU Hardware

CERN Document Server

Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department

2015-01-01

Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...
Nikola Tesla: een biografie

NARCIS (Netherlands)

ir.ing. Ruud Thelosen

2015-01-01

De Technische uitvinder Tesla heeft 700 patenten op zijn naam staan als hij overlijdt. Zijn uitvindingen hebben de wereld in de 20e eeuw volledig veranderd. Zijn wisselstroom generator hebben heel de VS van elektriciteit voorzien. Radioverkeer , remote control werd mogelijk dankzij Tesla. Volgens
GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores

Directory of Open Access Journals (Sweden)

Wang Kai

2011-05-01

Full Text Available Abstract Background Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs have multiple cores, whereas Graphics Processing Units (GPUs also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits. Findings Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1 the interaction of SNPs within it in parallel, and 2 the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run. Conclusions GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from http://www.cceb.upenn.edu/~mli/software/GENIE/.
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

Science.gov (United States)

2015-06-01

implementation of the direct interaction called particle-to-particle kernel for a shared-memory single GPU device using the Compute Unified Device Architecture ...GPU-defined P2P kernel we developed using the Compute Unified Device Architecture (CUDA).9 A brief outline of the rest of this work follows. The...Employed The computing environment used for this work is a 64-node heterogeneous cluster consisting of 48 IBM dx360M4 nodes, each with one Intel Phi
Graphics Processing Units (GPU) and the Goddard Earth Observing System atmospheric model (GEOS-5): Implementation and Potential Applications

Science.gov (United States)

Putnam, William M.

2011-01-01

Earth system models like the Goddard Earth Observing System model (GEOS-5) have been pushing the limits of large clusters of multi-core microprocessors, producing breath-taking fidelity in resolving cloud systems at a global scale. GPU computing presents an opportunity for improving the efficiency of these leading edge models. A GPU implementation of GEOS-5 will facilitate the use of cloud-system resolving resolutions in data assimilation and weather prediction, at resolutions near 3.5 km, improving our ability to extract detailed information from high-resolution satellite observations and ultimately produce better weather and climate predictions

Tesla - A Flash of a Genius

Science.gov (United States)

Teodorani, M.

2005-10-01

This book, which is entirely dedicated to the inventions of scientist Nikola Tesla, is divided into three parts: a) all the most important innovative technological creations from the alternate current to the death ray, Tesla research in fundamental physics with a particular attention to the concept of "ether", ball lightning physics; b) the life and the bright mind of Nikola Tesla and the reasons why some of his most recent findings were not accepted by the establishment; c) a critical discussion of the most important work by Tesla followers.
A GPU-accelerated semi-implicit fractional step method for numerical solutions of incompressible Navier-Stokes equations

Science.gov (United States)

Ha, Sanghyun; Park, Junshin; You, Donghyun

2017-11-01

Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).
TESLA superconducting RF cavity development

International Nuclear Information System (INIS)

Koepke, K.

1995-01-01

The TESLA collaboration has made steady progress since its first official meeting at Cornell in 1990. The infrastructure necessary to assemble and test superconducting rf cavities has been installed at the TESLA Test Facility (TTF) at DESY. 5-cell, 1.3 GHz cavities have been fabricated and have reached accelerating fields of 25 MV/m. Full sized 9-cell copper cavities of TESLA geometry have been measured to verify the higher order modes present and to evaluate HOM coupling designs. The design of the TESLA 9-cell cavity has been finalized and industry has started delivery. Two prototype 9-cell niobium cavities in their first tests have reached accelerating fields of 10 MV/m and 15 MV/m in a vertical dewar after high peak power (HPP) conditioning. The first 12 m TESLA cryomodule that will house 8 9-cell cavities is scheduled to be delivered in Spring 1995. A design report for the TTF is in progress. The TTF test linac is scheduled to be commissioned in 1996/1997. (orig.)
Flocking-based Document Clustering on the Graphics Processing Unit

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL; Patton, Robert M [ORNL; ST Charles, Jesse Lee [ORNL

2008-01-01

Abstract?Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. Each bird represents a single document and flies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to receive results in a reasonable amount of time. However, flocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have found increased performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefit the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NIVIDA? we developed a document flocking implementation to be run on the NIVIDA?GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3000 documents. The results of these tests were very significant. Performance gains ranged from three to nearly five times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms.
High Performance Computation of a Jet in Crossflow by Lattice Boltzmann Based Parallel Direct Numerical Simulation

Directory of Open Access Journals (Sweden)

Jiang Lei

2015-01-01

Full Text Available Direct numerical simulation (DNS of a round jet in crossflow based on lattice Boltzmann method (LBM is carried out on multi-GPU cluster. Data parallel SIMT (single instruction multiple thread characteristic of GPU matches the parallelism of LBM well, which leads to the high efficiency of GPU on the LBM solver. With present GPU settings (6 Nvidia Tesla K20M, the present DNS simulation can be completed in several hours. A grid system of 1.5 × 108 is adopted and largest jet Reynolds number reaches 3000. The jet-to-free-stream velocity ratio is set as 3.3. The jet is orthogonal to the mainstream flow direction. The validated code shows good agreement with experiments. Vortical structures of CRVP, shear-layer vortices and horseshoe vortices, are presented and analyzed based on velocity fields and vorticity distributions. Turbulent statistical quantities of Reynolds stress are also displayed. Coherent structures are revealed in a very fine resolution based on the second invariant of the velocity gradients.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Science.gov (United States)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Directory of Open Access Journals (Sweden)

DeTar Carleton

2018-01-01

Full Text Available With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Parallelization and checkpointing of GPU applications through program transformation

Energy Technology Data Exchange (ETDEWEB)

Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

2012-01-01

GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
SU-E-T-500: Initial Implementation of GPU-Based Particle Swarm Optimization for 4D IMRT Planning in Lung SBRT

International Nuclear Information System (INIS)

Modiri, A; Hagan, A; Gu, X; Sawant, A

2015-01-01

Purpose 4D-IMRT planning, combined with dynamic MLC tracking delivery, utilizes the temporal dimension as an additional degree of freedom to achieve improved OAR-sparing. The computational complexity for such optimization increases exponentially with increase in dimensionality. In order to accomplish this task in a clinically-feasible time frame, we present an initial implementation of GPU-based 4D-IMRT planning based on particle swarm optimization (PSO). Methods The target and normal structures were manually contoured on ten phases of a 4DCT scan of a NSCLC patient with a 54cm3 right-lower-lobe tumor (1.5cm motion). Corresponding ten 3D-IMRT plans were created in the Eclipse treatment planning system (Ver-13.6). A vendor-provided scripting interface was used to export 3D-dose matrices corresponding to each control point (10 phases × 9 beams × 166 control points = 14,940), which served as input to PSO. The optimization task was to iteratively adjust the weights of each control point and scale the corresponding dose matrices. In order to handle the large amount of data in GPU memory, dose matrices were sparsified and placed in contiguous memory blocks with the 14,940 weight-variables. PSO was implemented on CPU (dual-Xeon, 3.1GHz) and GPU (dual-K20 Tesla, 2496 cores, 3.52Tflops, each) platforms. NiftyReg, an open-source deformable image registration package, was used to calculate the summed dose. Results The 4D-PSO plan yielded PTV coverage comparable to the clinical ITV-based plan and significantly higher OAR-sparing, as follows: lung Dmean=33%; lung V20=27%; spinal cord Dmax=26%; esophagus Dmax=42%; heart Dmax=0%; heart Dmean=47%. The GPU-PSO processing time for 14940 variables and 7 PSO-particles was 41% that of CPU-PSO (199 vs. 488 minutes). Conclusion Truly 4D-IMRT planning can yield significant OAR dose-sparing while preserving PTV coverage. The corresponding optimization problem is large-scale, non-convex and computationally rigorous. Our initial results
SU-E-T-500: Initial Implementation of GPU-Based Particle Swarm Optimization for 4D IMRT Planning in Lung SBRT

Energy Technology Data Exchange (ETDEWEB)

Modiri, A; Hagan, A; Gu, X; Sawant, A [UT Southwestern Medical Center, Dallas, TX (United States)

2015-06-15

Purpose 4D-IMRT planning, combined with dynamic MLC tracking delivery, utilizes the temporal dimension as an additional degree of freedom to achieve improved OAR-sparing. The computational complexity for such optimization increases exponentially with increase in dimensionality. In order to accomplish this task in a clinically-feasible time frame, we present an initial implementation of GPU-based 4D-IMRT planning based on particle swarm optimization (PSO). Methods The target and normal structures were manually contoured on ten phases of a 4DCT scan of a NSCLC patient with a 54cm3 right-lower-lobe tumor (1.5cm motion). Corresponding ten 3D-IMRT plans were created in the Eclipse treatment planning system (Ver-13.6). A vendor-provided scripting interface was used to export 3D-dose matrices corresponding to each control point (10 phases × 9 beams × 166 control points = 14,940), which served as input to PSO. The optimization task was to iteratively adjust the weights of each control point and scale the corresponding dose matrices. In order to handle the large amount of data in GPU memory, dose matrices were sparsified and placed in contiguous memory blocks with the 14,940 weight-variables. PSO was implemented on CPU (dual-Xeon, 3.1GHz) and GPU (dual-K20 Tesla, 2496 cores, 3.52Tflops, each) platforms. NiftyReg, an open-source deformable image registration package, was used to calculate the summed dose. Results The 4D-PSO plan yielded PTV coverage comparable to the clinical ITV-based plan and significantly higher OAR-sparing, as follows: lung Dmean=33%; lung V20=27%; spinal cord Dmax=26%; esophagus Dmax=42%; heart Dmax=0%; heart Dmean=47%. The GPU-PSO processing time for 14940 variables and 7 PSO-particles was 41% that of CPU-PSO (199 vs. 488 minutes). Conclusion Truly 4D-IMRT planning can yield significant OAR dose-sparing while preserving PTV coverage. The corresponding optimization problem is large-scale, non-convex and computationally rigorous. Our initial results
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

Science.gov (United States)

Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.

2014-01-01

Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
GPU-accelerated micromagnetic simulations using cloud computing

Energy Technology Data Exchange (ETDEWEB)

Jermain, C.L., E-mail: clj72@cornell.edu [Cornell University, Ithaca, NY 14853 (United States); Rowlands, G.E.; Buhrman, R.A. [Cornell University, Ithaca, NY 14853 (United States); Ralph, D.C. [Cornell University, Ithaca, NY 14853 (United States); Kavli Institute at Cornell, Ithaca, NY 14853 (United States)

2016-03-01

Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.
GPU-accelerated micromagnetic simulations using cloud computing

International Nuclear Information System (INIS)

Jermain, C.L.; Rowlands, G.E.; Buhrman, R.A.; Ralph, D.C.

2016-01-01

Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics. - Highlights: • The benefits of cloud computing for GPU-accelerated micromagnetics are examined. • We present the MuCloud software for running simulations on cloud computing. • Simulation run times are measured to benchmark cloud computing performance. • Comparison benchmarks are analyzed between CPU and GPU based solvers.
GPU Accelerated Chemical Similarity Calculation for Compound Library Comparison

Science.gov (United States)

Ma, Chao; Wang, Lirong; Xie, Xiang-Qun

2012-01-01

Chemical similarity calculation plays an important role in compound library design, virtual screening, and “lead” optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multi-core GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 minutes to complete the calculation of Tanimoto coefficients between 32M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU. PMID:21692447
APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

International Nuclear Information System (INIS)

Ammendola, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

2012-01-01

In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative — the QUonG project — whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k€/T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.
APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

Energy Technology Data Exchange (ETDEWEB)

Ammendola, R [INFN Tor Vergata (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma (Italy)

2012-12-13

In the supercomputing arena, the strong rise of GPU-accelerated clusters is a matter of fact. Within INFN, we proposed an initiative - the QUonG project - whose aim is to deploy a high performance computing system dedicated to scientific computations leveraging on commodity multi-core processors coupled with latest generation GPUs. The inter-node interconnection system is based on a point-to-point, high performance, low latency 3D torus network which is built in the framework of the APEnet+ project. It takes the form of an FPGA-based PCIe network card exposing six full bidirectional links running at 34 Gbps each that implements the RDMA protocol. In order to enable significant access latency reduction for inter-node data transfer, a direct network-to-GPU interface was built. The specialized hardware blocks, integrated in the APEnet+ board, provide support for GPU-initiated communications using the so called PCIe peer-to-peer (P2P) transactions. This development is made in close collaboration with the GPU vendor NVIDIA. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 80 TFLOPS/rack of peak performance, at a cost of 5 k Euro-Sign /T F LOPS and for an estimated power consumption of 25 kW/rack. In this paper we report on the status of final rack deployment and on the R and D activities for 2012 that will focus on performance enhancement of the APEnet+ hardware through the adoption of new generation 28 nm FPGAs allowing the implementation of PCIe Gen3 host interface and the addition of new fault tolerance-oriented capabilities.
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

International Nuclear Information System (INIS)

Ammendola A, R; Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

2014-01-01

APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

Science.gov (United States)

Ammendola A, R.; Biagioni, A.; Frezza, O.; Lo Cicero, F.; Lonardo, A.; Paolucci, P. S.; Rossetti, D.; Simula, F.; Tosoratto, L.; Vicini, P.

2014-06-01

APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

Energy Technology Data Exchange (ETDEWEB)

Ammendola A, R [INFN Roma II, Via della Ricerca Scientifica 1 – 00133 Roma (Italy); Biagioni, A; Frezza, O; Lo Cicero, F; Lonardo, A; Paolucci, P S; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma I, P.le Aldo Moro 2 – 00185 Roma (Italy)

2014-06-06

APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.
GPU: the biggest key processor for AI and parallel processing

Science.gov (United States)

Baji, Toru

2017-07-01

Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.

GPU Implementation of High Rayleigh Number Three-Dimensional Mantle Convection

Science.gov (United States)

Sanchez, D. A.; Yuen, D. A.; Wright, G. B.; Barnett, G. A.

2010-12-01

Although we have entered the age of petascale computing, many factors are still prohibiting high-performance computing (HPC) from infiltrating all suitable scientific disciplines. For this reason and others, application of GPU to HPC is gaining traction in the scientific world. With its low price point, high performance potential, and competitive scalability, GPU has been an option well worth considering for the last few years. Moreover with the advent of NVIDIA's Fermi architecture, which brings ECC memory, better double-precision performance, and more RAM to GPU, there is a strong message of corporate support for GPU in HPC. However many doubts linger concerning the practicality of using GPU for scientific computing. In particular, GPU has a reputation for being difficult to program and suitable for only a small subset of problems. Although inroads have been made in addressing these concerns, for many scientists GPU still has hurdles to clear before becoming an acceptable choice. We explore the applicability of GPU to geophysics by implementing a three-dimensional, second-order finite-difference model of Rayleigh-Benard thermal convection on an NVIDIA GPU using C for CUDA. Our code reaches sufficient resolution, on the order of 500x500x250 evenly-spaced finite-difference gridpoints, on a single GPU. We make extensive use of highly optimized CUBLAS routines, allowing us to achieve performance on the order of O( 0.1 ) µs per timestep*gridpoint at this resolution. This performance has allowed us to study high Rayleigh number simulations, on the order of 2x10^7, on a single GPU.
A Novel CPU/GPU Simulation Environment for Large-Scale Biologically-Realistic Neural Modeling

Directory of Open Access Journals (Sweden)

Roger V Hoang

2013-10-01

Full Text Available Computational Neuroscience is an emerging field that provides unique opportunities to studycomplex brain structures through realistic neural simulations. However, as biological details are added tomodels, the execution time for the simulation becomes longer. Graphics Processing Units (GPUs are now being utilized to accelerate simulations due to their ability to perform computations in parallel. As such, they haveshown significant improvement in execution time compared to Central Processing Units (CPUs. Most neural simulators utilize either multiple CPUs or a single GPU for better performance, but still show limitations in execution time when biological details are not sacrificed. Therefore, we present a novel CPU/GPU simulation environment for large-scale biological networks,the NeoCortical Simulator version 6 (NCS6. NCS6 is a free, open-source, parallelizable, and scalable simula-tor, designed to run on clusters of multiple machines, potentially with high performance computing devicesin each of them. It has built-in leaky-integrate-and-fire (LIF and Izhikevich (IZH neuron models, but usersalso have the capability to design their own plug-in interface for different neuron types as desired. NCS6is currently able to simulate one million cells and 100 million synapses in quasi real time by distributing dataacross these heterogeneous clusters of CPUs and GPUs.
Tesla coil theoretical model and experimental verification

OpenAIRE

Voitkans, Janis; Voitkans, Arnis

2014-01-01

Abstract – In this paper a theoretical model of a Tesla coil operation is proposed. Tesla coil is described as a long line with distributed parameters in a single-wired format, where the line voltage is measured against electrically neutral space. It is shown that equivalent two-wired scheme can be found for a single-wired scheme and already known long line theory can be applied to a Tesla coil. Formulas for calculation of voltage in a Tesla coil by coordinate and calculation of resonance fre...
cudaMap: a GPU accelerated program for gene expression connectivity mapping.

Science.gov (United States)

McArt, Darragh G; Bankhead, Peter; Dunne, Philip D; Salto-Tellez, Manuel; Hamilton, Peter; Zhang, Shu-Dong

2013-10-11

Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping. cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance. Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.
Real-Time Incompressible Fluid Simulation on the GPU

Directory of Open Access Journals (Sweden)

Xiao Nie

2015-01-01

Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.
GPU applications for data processing

Energy Technology Data Exchange (ETDEWEB)

Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)

2015-12-31

Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
GPU-based high-performance computing for radiation therapy

International Nuclear Information System (INIS)

Jia, Xun; Jiang, Steve B; Ziegenhein, Peter

2014-01-01

Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented. (topical review)
Haptic Feedback for the GPU-based Surgical Simulator

DEFF Research Database (Denmark)

Sørensen, Thomas Sangild; Mosegaard, Jesper

2006-01-01

The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....
Artefacts induced by coiled intracranial aneurysms on 3.0-Tesla versus 1.5-Tesla MR angiography—An in vivo and in vitro study

Energy Technology Data Exchange (ETDEWEB)

Schaafsma, Joanna D., E-mail: j.d.schaafsma@umcutrecht.nl [Department of Neurology, Rudolf Magnus Institute of Neuroscience, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands); Velthuis, Birgitta K., E-mail: b.k.velthuis@umcutrecht.nl [Imaging Division, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands); Vincken, Koen L., E-mail: koen@isi.uu.nl [Image Sciences Institute, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands); Kort, Gerard A.P. de, E-mail: g.a.p.dekort@umcutrecht.nl [Imaging Division, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands); Rinkel, Gabriel J.E., E-mail: g.j.e.rinkel@umcutrecht.nl [Department of Neurology, Rudolf Magnus Institute of Neuroscience, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands); Bartels, Lambertus W., E-mail: w.bartels@umcutrecht.nl [Image Sciences Institute, University Medical Centre, PO Box 85500, 3508 GA Utrecht (Netherlands)

2014-05-15

Objective: To compare metal-induced artefacts from coiled intracranial aneurysms on 3.0-Tesla and 1.5-Tesla magnetic resonance angiography (MRA), since concerns persist on artefact enlargement at 3.0 Tesla. Materials and methods: We scanned 19 patients (mean age 53; 16 women) with 20 saccular aneurysms treated with coils only, at 1.5 and 3.0 Tesla according to standard clinical 3D TOF-MRA protocols containing a shorter echo-time but weaker read-out gradient at 3.0 Tesla in addition to intra-arterial digital subtraction angiography (IA-DSA). Per modality two neuro-radiologists assessed the occlusion status, measured residual flow, and indicated whether coil artefacts disturbed this assessment on MRA. We assessed relative risks for disturbance by coil artefacts, weighted kappa's for agreement on occlusion levels, and we compared remnant sizes. For artefact measurements, a coil model was created and scanned with the same protocols followed by 2D MR scans with variation of echo-time and read-out gradient strength. Results: Coil artefacts disturbed assessments less frequently at 3.0 Tesla than at 1.5 Tesla (RR: 0.3; 95%CI: 0.1–0.8). On 3.0-Tesla MRA, remnants were larger than on 1.5-Tesla MRA (difference: 0.7 mm; 95%CI: 0.3–1.1) and larger than on IA-DSA (difference: 1.0 mm; 95%CI: 0.6–1.5) with similar agreement on occlusion levels with IA-DSA for both field strengths (κ 0.53; 95%CI: 0.23–0.84 for 1.5-Tesla MRA and IA-DSA; κ 0.47; 95%CI: 0.19–0.76 for 3.0-Tesla MRA and IA-DSA). Coil model artefacts were smaller at 3.0 Tesla than at 1.5 Tesla. The echo-time influenced artefact size more than the read-out gradient. Conclusions: Artefacts were not larger, but smaller at 3.0 Tesla because a shorter echo-time at 3.0 Tesla negated artefact enlargement. Despite smaller artefacts and larger remnants at 3.0 Tesla, occlusion levels were similar for both field strengths.
Artefacts induced by coiled intracranial aneurysms on 3.0-Tesla versus 1.5-Tesla MR angiography—An in vivo and in vitro study

International Nuclear Information System (INIS)

Schaafsma, Joanna D.; Velthuis, Birgitta K.; Vincken, Koen L.; Kort, Gerard A.P. de; Rinkel, Gabriel J.E.; Bartels, Lambertus W.

2014-01-01

Objective: To compare metal-induced artefacts from coiled intracranial aneurysms on 3.0-Tesla and 1.5-Tesla magnetic resonance angiography (MRA), since concerns persist on artefact enlargement at 3.0 Tesla. Materials and methods: We scanned 19 patients (mean age 53; 16 women) with 20 saccular aneurysms treated with coils only, at 1.5 and 3.0 Tesla according to standard clinical 3D TOF-MRA protocols containing a shorter echo-time but weaker read-out gradient at 3.0 Tesla in addition to intra-arterial digital subtraction angiography (IA-DSA). Per modality two neuro-radiologists assessed the occlusion status, measured residual flow, and indicated whether coil artefacts disturbed this assessment on MRA. We assessed relative risks for disturbance by coil artefacts, weighted kappa's for agreement on occlusion levels, and we compared remnant sizes. For artefact measurements, a coil model was created and scanned with the same protocols followed by 2D MR scans with variation of echo-time and read-out gradient strength. Results: Coil artefacts disturbed assessments less frequently at 3.0 Tesla than at 1.5 Tesla (RR: 0.3; 95%CI: 0.1–0.8). On 3.0-Tesla MRA, remnants were larger than on 1.5-Tesla MRA (difference: 0.7 mm; 95%CI: 0.3–1.1) and larger than on IA-DSA (difference: 1.0 mm; 95%CI: 0.6–1.5) with similar agreement on occlusion levels with IA-DSA for both field strengths (κ 0.53; 95%CI: 0.23–0.84 for 1.5-Tesla MRA and IA-DSA; κ 0.47; 95%CI: 0.19–0.76 for 3.0-Tesla MRA and IA-DSA). Coil model artefacts were smaller at 3.0 Tesla than at 1.5 Tesla. The echo-time influenced artefact size more than the read-out gradient. Conclusions: Artefacts were not larger, but smaller at 3.0 Tesla because a shorter echo-time at 3.0 Tesla negated artefact enlargement. Despite smaller artefacts and larger remnants at 3.0 Tesla, occlusion levels were similar for both field strengths
Artefacts induced by coiled intracranial aneurysms on 3.0-Tesla versus 1.5-Tesla MR angiography--An in vivo and in vitro study.

Science.gov (United States)

Schaafsma, Joanna D; Velthuis, Birgitta K; Vincken, Koen L; de Kort, Gerard A P; Rinkel, Gabriel J E; Bartels, Lambertus W

2014-05-01

To compare metal-induced artefacts from coiled intracranial aneurysms on 3.0-Tesla and 1.5-Tesla magnetic resonance angiography (MRA), since concerns persist on artefact enlargement at 3.0Tesla. We scanned 19 patients (mean age 53; 16 women) with 20 saccular aneurysms treated with coils only, at 1.5 and 3.0Tesla according to standard clinical 3D TOF-MRA protocols containing a shorter echo-time but weaker read-out gradient at 3.0Tesla in addition to intra-arterial digital subtraction angiography (IA-DSA). Per modality two neuro-radiologists assessed the occlusion status, measured residual flow, and indicated whether coil artefacts disturbed this assessment on MRA. We assessed relative risks for disturbance by coil artefacts, weighted kappa's for agreement on occlusion levels, and we compared remnant sizes. For artefact measurements, a coil model was created and scanned with the same protocols followed by 2D MR scans with variation of echo-time and read-out gradient strength. Coil artefacts disturbed assessments less frequently at 3.0Tesla than at 1.5Tesla (RR: 0.3; 95%CI: 0.1-0.8). On 3.0-Tesla MRA, remnants were larger than on 1.5-Tesla MRA (difference: 0.7mm; 95%CI: 0.3-1.1) and larger than on IA-DSA (difference: 1.0mm; 95%CI: 0.6-1.5) with similar agreement on occlusion levels with IA-DSA for both field strengths (κ 0.53; 95%CI: 0.23-0.84 for 1.5-Tesla MRA and IA-DSA; κ 0.47; 95%CI: 0.19-0.76 for 3.0-Tesla MRA and IA-DSA). Coil model artefacts were smaller at 3.0Tesla than at 1.5Tesla. The echo-time influenced artefact size more than the read-out gradient. Artefacts were not larger, but smaller at 3.0Tesla because a shorter echo-time at 3.0Tesla negated artefact enlargement. Despite smaller artefacts and larger remnants at 3.0Tesla, occlusion levels were similar for both field strengths. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
A survey and measurement study of GPU DVFS on energy conservation

Directory of Open Access Journals (Sweden)

Xinxin Mei

2017-05-01

Full Text Available Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications.
FLOCKING-BASED DOCUMENT CLUSTERING ON THE GRAPHICS PROCESSING UNIT [Book Chapter

Energy Technology Data Exchange (ETDEWEB)

Charles, J S; Patton, R M; Potok, T E; Cui, X

2008-01-01

Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the fl ocking behavior of birds. Each bird represents a single document and fl ies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly diffi cult to receive results in a reasonable amount of time. However, fl ocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have experienced improved performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefi t the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NVIDIA®, we developed a document fl ocking implementation to be run on the NVIDIA® GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3,000 documents. The results of these tests were very signifi cant. Performance gains ranged from three to nearly fi ve times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms.
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.

Directory of Open Access Journals (Sweden)

Andrea Manconi

Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
GPU in Physics Computation: Case Geant4 Navigation

CERN Document Server

Seiskari, Otto; Niemi, Tapio

2012-01-01

General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.
Accelerating large-scale phase-field simulations with GPU

Directory of Open Access Journals (Sweden)

Xiaoming Shi

2017-10-01

Full Text Available A new package for accelerating large-scale phase-field simulations was developed by using GPU based on the semi-implicit Fourier method. The package can solve a variety of equilibrium equations with different inhomogeneity including long-range elastic, magnetostatic, and electrostatic interactions. Through using specific algorithm in Compute Unified Device Architecture (CUDA, Fourier spectral iterative perturbation method was integrated in GPU package. The Allen-Cahn equation, Cahn-Hilliard equation, and phase-field model with long-range interaction were solved based on the algorithm running on GPU respectively to test the performance of the package. From the comparison of the calculation results between the solver executed in single CPU and the one on GPU, it was found that the speed on GPU is enormously elevated to 50 times faster. The present study therefore contributes to the acceleration of large-scale phase-field simulations and provides guidance for experiments to design large-scale functional devices.
GPU Linear algebra extensions for GNU/Octave

International Nuclear Information System (INIS)

Bosi, L B; Mariotti, M; Santocchia, A

2012-01-01

Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.
Fully 3D GPU PET reconstruction

Energy Technology Data Exchange (ETDEWEB)

Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)

2011-08-21

Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
Fully 3D GPU PET reconstruction

International Nuclear Information System (INIS)

Herraiz, J.L.; Espana, S.; Cal-Gonzalez, J.; Vaquero, J.J.; Desco, M.; Udias, J.M.

2011-01-01

Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
Quantum Chemical Calculations Using Accelerators: Migrating Matrix Operations to the NVIDIA Kepler GPU and the Intel Xeon Phi.

Science.gov (United States)

Leang, Sarom S; Rendell, Alistair P; Gordon, Mark S

2014-03-11

Increasingly, modern computer systems comprise a multicore general-purpose processor augmented with a number of special purpose devices or accelerators connected via an external interface such as a PCI bus. The NVIDIA Kepler Graphical Processing Unit (GPU) and the Intel Phi are two examples of such accelerators. Accelerators offer peak performances that can be well above those of the host processor. How to exploit this heterogeneous environment for legacy application codes is not, however, straightforward. This paper considers how matrix operations in typical quantum chemical calculations can be migrated to the GPU and Phi systems. Double precision general matrix multiply operations are endemic in electronic structure calculations, especially methods that include electron correlation, such as density functional theory, second order perturbation theory, and coupled cluster theory. The use of approaches that automatically determine whether to use the host or an accelerator, based on problem size, is explored, with computations that are occurring on the accelerator and/or the host. For data-transfers over PCI-e, the GPU provides the best overall performance for data sizes up to 4096 MB with consistent upload and download rates between 5-5.6 GB/s and 5.4-6.3 GB/s, respectively. The GPU outperforms the Phi for both square and nonsquare matrix multiplications.

Superconducting TESLA cavities

Directory of Open Access Journals (Sweden)

B. Aune

2000-09-01

Full Text Available The conceptional design of the proposed linear electron-positron collider TESLA is based on 9-cell 1.3 GHz superconducting niobium cavities with an accelerating gradient of E_{acc}≥25 MV/m at a quality factor Q_{0}≥5×10^{9}. The design goal for the cavities of the TESLA Test Facility (TTF linac was set to the more moderate value of E_{acc}≥15 MV/m. In a first series of 27 industrially produced TTF cavities the average gradient at Q_{0}=5×10^{9} was measured to be 20.1±6.2 MV/m, excluding a few cavities suffering from serious fabrication or material defects. In the second production of 24 TTF cavities, additional quality control measures were introduced, in particular, an eddy-current scan to eliminate niobium sheets with foreign material inclusions and stringent prescriptions for carrying out the electron-beam welds. The average gradient of these cavities at Q_{0}=5×10^{9} amounts to 25.0±3.2 MV/m with the exception of one cavity suffering from a weld defect. Hence only a moderate improvement in production and preparation techniques will be needed to meet the ambitious TESLA goal with an adequate safety margin. In this paper we present a detailed description of the design, fabrication, and preparation of the TESLA Test Facility cavities and their associated components and report on cavity performance in test cryostats and with electron beam in the TTF linac. The ongoing research and development towards higher gradients is briefly addressed.
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

Directory of Open Access Journals (Sweden)

Ginés D. Guerrero

2014-01-01

Full Text Available Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO. This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor.
TESLA: Large Signal Simulation Code for Klystrons

International Nuclear Information System (INIS)

Vlasov, Alexander N.; Cooke, Simon J.; Chernin, David P.; Antonsen, Thomas M. Jr.; Nguyen, Khanh T.; Levush, Baruch

2003-01-01

TESLA (Telegraphist's Equations Solution for Linear Beam Amplifiers) is a new code designed to simulate linear beam vacuum electronic devices with cavities, such as klystrons, extended interaction klystrons, twistrons, and coupled cavity amplifiers. The model includes a self-consistent, nonlinear solution of the three-dimensional electron equations of motion and the solution of time-dependent field equations. The model differs from the conventional Particle in Cell approach in that the field spectrum is assumed to consist of a carrier frequency and its harmonics with slowly varying envelopes. Also, fields in the external cavities are modeled with circuit like equations and couple to fields in the beam region through boundary conditions on the beam tunnel wall. The model in TESLA is an extension of the model used in gyrotron code MAGY. The TESLA formulation has been extended to be capable to treat the multiple beam case, in which each beam is transported inside its own tunnel. The beams interact with each other as they pass through the gaps in their common cavities. The interaction is treated by modification of the boundary conditions on the wall of each tunnel to include the effect of adjacent beams as well as the fields excited in each cavity. The extended version of TESLA for the multiple beam case, TESLA-MB, has been developed for single processor machines, and can run on UNIX machines and on PC computers with a large memory (above 2GB). The TESLA-MB algorithm is currently being modified to simulate multiple beam klystrons on multiprocessor machines using the MPI (Message Passing Interface) environment. The code TESLA has been verified by comparison with MAGIC for single and multiple beam cases. The TESLA code and the MAGIC code predict the same power within 1% for a simple two cavity klystron design while the computational time for TESLA is orders of magnitude less than for MAGIC 2D. In addition, recently TESLA was used to model the L-6048 klystron, code
GPU acceleration of Eulerian-Lagrangian particle-laden turbulent flow simulations

Science.gov (United States)

Richter, David; Sweet, James; Thain, Douglas

2017-11-01

The Lagrangian point-particle approximation is a popular numerical technique for representing dispersed phases whose properties can substantially deviate from the local fluid. In many cases, particularly in the limit of one-way coupled systems, large numbers of particles are desired; this may be either because many physical particles are present (e.g. LES of an entire cloud), or because the use of many particles increases statistical convergence (e.g. high-order statistics). Solving the trajectories of very large numbers of particles can be problematic in traditional MPI implementations, however, and this study reports the benefits of using graphical processing units (GPUs) to integrate the particle equations of motion while preserving the original MPI version of the Eulerian flow solver. It is found that GPU acceleration becomes cost effective around one million particles, and performance enhancements of up to 15x can be achieved when O(108) particles are computed on the GPU rather than the CPU cluster. Optimizations and limitations will be discussed, as will prospects for expanding to two- and four-way coupled systems. ONR Grant No. N00014-16-1-2472.
GPU accelerated population annealing algorithm

Science.gov (United States)

Barash, Lev Yu.; Weigel, Martin; Borovský, Michal; Janke, Wolfhard; Shchur, Lev N.

2017-11-01

steps and multi-histogram reweighting. Additional comments: Code repository at https://github.com/LevBarash/PAising. The system size and size of the population of replicas are limited depending on the memory of the GPU device used. For the default parameter values used in the sample programs, L = 64, θ = 100, β0 = 0, βf = 1, Δβ = 0 . 005, R = 20 000, a typical run time on an NVIDIA Tesla K80 GPU is 151 seconds for the single spin coded (SSC) and 17 seconds for the multi-spin coded (MSC) program (see Section 2 for a description of these parameters).
Small horizontal emittance in the TESLA damping ring

International Nuclear Information System (INIS)

Decking, W.

2001-01-01

The present TESLA damping ring is designed for a normalized horizontal emittance of 8x10 -6 m. γ-γ collisions at the TESLA linear collider will benefit from a further decrease of the horizontal emittance. This paper reviews the processes which limit the horizontal emittance in the damping ring. Preliminary estimates on the smallest horizontal emittance for the present TESLA damping ring design as well as an ultimate limit of the emittance reachable with the TESLA damping ring concept will be given
NIKOLA TESLA AND THE X-RAY

OpenAIRE

Rade R. Babic

2005-01-01

After professor Wilhelm Konrad Röntgen published his study of an x-ray discovery (Academy Bulletin, Berlin, 08. 11. 1895.), Nikola Tesla published his first study of an x-ray on the 11th of March in 1896. (X-ray, Electrical Review). Until the 11th of August in 1897 he had published ten studies on this subject. All Tesla,s x-ray studies were experimental, which is specific to his work. Studying the nature of the x-ray, he established a new medical branch-radiology. He wrote:” There’s no doubt...
GPU-accelerated computation of electron transfer.

Science.gov (United States)

Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco

2012-11-05

Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.
Blaze-DEMGPU: Modular high performance DEM framework for the GPU architecture

Directory of Open Access Journals (Sweden)

Nicolin Govender

2016-01-01

Full Text Available Blaze-DEMGPU is a modular GPU based discrete element method (DEM framework that supports polyhedral shaped particles. The high level performance is attributed to the light weight and Single Instruction Multiple Data (SIMD that the GPU architecture offers. Blaze-DEMGPU offers suitable algorithms to conduct DEM simulations on the GPU and these algorithms can be extended and modified. Since a large number of scientific simulations are particle based, many of the algorithms and strategies for GPU implementation present in Blaze-DEMGPU can be applied to other fields. Blaze-DEMGPU will make it easier for new researchers to use high performance GPU computing as well as stimulate wider GPU research efforts by the DEM community.
Photon collider at TESLA

International Nuclear Information System (INIS)

Telnov, Valery

2001-01-01

High energy photon colliders (γγ, γe) based on backward Compton scattering of laser light is a very natural addition to e + e - linear colliders. In this report, we consider this option for the TESLA project. Recent study has shown that the horizontal emittance in the TESLA damping ring can be further decreased by a factor of four. In this case, the γγ luminosity in the high energy part of spectrum can reach about (1/3)L e + e - . Typical cross-sections of interesting processes in γγ collisions are higher than those in e + e - collisions by about one order of magnitude, so the number of events in γγ collisions will be more than that in e + e - collisions. Photon colliders can, certainly, give additional information and they are the best for the study of many phenomena. The main question is now the technical feasibility. The key new element in photon colliders is a very powerful laser system. An external optical cavity is a promising approach for the TESLA project. A free electron laser is another option. However, a more straightforward solution is ''an optical storage ring (optical trap)'' with a diode pumped solid state laser injector which is today technically feasible. This paper briefly reviews the status of a photon collider based on the linear collider TESLA, its possible parameters and existing problems
GPU-based high performance Monte Carlo simulation in neutron transport

Energy Technology Data Exchange (ETDEWEB)

Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br

2009-07-01

Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
GPU-based high performance Monte Carlo simulation in neutron transport

International Nuclear Information System (INIS)

Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

2009-01-01

Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
Tesla Roadsterin vauriokorjaus

OpenAIRE

Hiltunen, Santeri

2016-01-01

Insinöörityössä perehdyttiin Tesla-sähköauton rakenteeseen sekä korjaamiseen, sähkötyö-turvallisuuteen sekä sähkötekniikkaan. Työn tavoitteena on selvittää, mitä laki vaatii sähköauton korjaamiseen, ja kuinka saada varaosia ja ohjeita ajoneuvoon, jolla ei ole Suomessa maahantuojaa. Lisäksi tavoitteena oli selvittää, minkälainen auto on kyseessä sekä mitä materiaaleja autoon on käytetty. Työssä korjattiin takaosasta mekaanisilta osiltaan vaurioitunut Tesla Roadster -merkkinen sähköaut...
About the origin of matter - the TESLA project

International Nuclear Information System (INIS)

Heuer, R.D.

2004-01-01

An introduction to the TESLA project is given. After a general introduction to the standard model of elementary particles together with some possible extension the scientific potential of TESLA is described with special regards to the production of Higgs bosons and supersymmetric particles. Finally the technology of TESLA is considered. (HSI)
Work-Efficient Parallel Skyline Computation for the GPU

DEFF Research Database (Denmark)

Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira

2015-01-01

offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...
Tesla Coil Theoretical Model and its Experimental Verification

Directory of Open Access Journals (Sweden)

Voitkans Janis

2014-12-01

Full Text Available In this paper a theoretical model of Tesla coil operation is proposed. Tesla coil is described as a long line with distributed parameters in a single-wire form, where the line voltage is measured across electrically neutral space. By applying the principle of equivalence of single-wire and two-wire schemes an equivalent two-wire scheme can be found for a single-wire scheme and the already known long line theory can be applied to the Tesla coil. A new method of multiple reflections is developed to characterize a signal in a long line. Formulas for calculation of voltage in Tesla coil by coordinate and calculation of resonance frequencies are proposed. The theoretical calculations are verified experimentally. Resonance frequencies of Tesla coil are measured and voltage standing wave characteristics are obtained for different output capacities in the single-wire mode. Wave resistance and phase coefficient of Tesla coil is obtained. Experimental measurements show good compliance with the proposed theory. The formulas obtained in this paper are also usable for a regular two-wire long line with distributed parameters.
NLSEmagic: Nonlinear Schrödinger equation multi-dimensional Matlab-based GPU-accelerated integrators using compact high-order schemes

Science.gov (United States)

Caplan, R. M.

2013-04-01

We present a simple to use, yet powerful code package called NLSEmagic to numerically integrate the nonlinear Schrödinger equation in one, two, and three dimensions. NLSEmagic is a high-order finite-difference code package which utilizes graphic processing unit (GPU) parallel architectures. The codes running on the GPU are many times faster than their serial counterparts, and are much cheaper to run than on standard parallel clusters. The codes are developed with usability and portability in mind, and therefore are written to interface with MATLAB utilizing custom GPU-enabled C codes with the MEX-compiler interface. The packages are freely distributed, including user manuals and set-up files. Catalogue identifier: AEOJ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOJ_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 124453 No. of bytes in distributed program, including test data, etc.: 4728604 Distribution format: tar.gz Programming language: C, CUDA, MATLAB. Computer: PC, MAC. Operating system: Windows, MacOS, Linux. Has the code been vectorized or parallelized?: Yes. Number of processors used: Single CPU, number of GPU processors dependent on chosen GPU card (max is currently 3072 cores on GeForce GTX 690). Supplementary material: Setup guide, Installation guide. RAM: Highly dependent on dimensionality and grid size. For typical medium-large problem size in three dimensions, 4GB is sufficient. Keywords: Nonlinear Schröodinger Equation, GPU, high-order finite difference, Bose-Einstien condensates. Classification: 4.3, 7.7. Nature of problem: Integrate solutions of the time-dependent one-, two-, and three-dimensional cubic nonlinear Schrödinger equation. Solution method: The integrators utilize a fully-explicit fourth-order Runge-Kutta scheme in time
GPU accelerated manifold correction method for spinning compact binaries

Science.gov (United States)

Ran, Chong-xi; Liu, Song; Zhong, Shuang-ying

2018-04-01

The graphics processing unit (GPU) acceleration of the manifold correction algorithm based on the compute unified device architecture (CUDA) technology is designed to simulate the dynamic evolution of the Post-Newtonian (PN) Hamiltonian formulation of spinning compact binaries. The feasibility and the efficiency of parallel computation on GPU have been confirmed by various numerical experiments. The numerical comparisons show that the accuracy on GPU execution of manifold corrections method has a good agreement with the execution of codes on merely central processing unit (CPU-based) method. The acceleration ability when the codes are implemented on GPU can increase enormously through the use of shared memory and register optimization techniques without additional hardware costs, implying that the speedup is nearly 13 times as compared with the codes executed on CPU for phase space scan (including 314 × 314 orbits). In addition, GPU-accelerated manifold correction method is used to numerically study how dynamics are affected by the spin-induced quadrupole-monopole interaction for black hole binary system.
GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium

Science.gov (United States)

Mick, Jason; Hailat, Eyad; Russo, Vincent; Rushaidat, Kamel; Schwiebert, Loren; Potoff, Jeffrey

2013-12-01

This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc∗=1.312(2) and density ρc∗=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].
GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU.

NARCIS (Netherlands)

Braak, van den G.J.W.; Nugteren, C.; Mesman, B.; Corporaal, H.; Kaklamanis, C.; Papatheodorou, T.; Spirakis, P.G.

2012-01-01

Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to

Development of parallel GPU based algorithms for problems in nuclear area; Desenvolvimento de algoritmos paralelos baseados em GPU para solucao de problemas na area nuclear

Energy Technology Data Exchange (ETDEWEB)

Almeida, Adino Americo Heimlich

2009-07-01

Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
cellGPU: Massively parallel simulations of dynamic vertex models

Science.gov (United States)

Sussman, Daniel M.

2017-10-01

Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
TeSLA e-assessment workshop pilot 2

OpenAIRE

Janssen, José

2017-01-01

Presentatie ten behoeve van de e-assessment workshop voor docenten van de Open Universiteit Nederland betrokken in de tweede TeSLA pilot. Topics: toetsfraude, toetsdesign, technologie voor authenticatie en verificatie van auteurschap, TeSLA instrument.
T2 relaxation time in patellar cartilage - global and regional reproducibility at 1.5 Tesla and 3 Tesla

International Nuclear Information System (INIS)

Glaser, C.; Horng, A.; Mendlik, T.; Weckbach, S.; Hoffmann, R.T.; Wagner, S.; Raya, J.G.; Reiser, M.; Horger, W.

2007-01-01

Purpose: Evaluation of the global and regional reproducibility of T2 relaxation time in patellar cartilage at 1.5 T and 3 T. Materials and Methods: 6 left patellae of 6 healthy volunteers (aged 25-30, 3 female, 3 male) were examined using a fat-saturated multiecho sequence and a T1-w 3D-FLASH sequence with water excitation at 1.5 Tesla and 3 Tesla. Three consecutive data sets were acquired within one MRI session with the examined knee being repositioned in the coil and scanner between each data set. The segmented cartilage (FLASH sequence) was overlaid on the multiecho data and T2 values were calculated for the total cartilage, 3 horizontal layers consisting of a superficial, intermedial and deep layer, 3 facets consisting of a medial, median (ridge) and lateral facet (global T2 values) and 27 ROIs/MRI slices (regional T2 value). The reproducibility (precision error) was calculated as the root mean square average of the individual standard deviations [ms] and coefficients of variation (COV) [%]. Results: The mean global reproducibility error for T2 was 3.53% (±0.38%) at 1.5 Tesla and 3.25% (±0.61%) at 3 Tesla. The mean regional reproducibility error for T2 was 8.62% (±2.61%) at 1.5 Tesla and 9.66% (±3.37%) at 3 Tesla. There was no significant difference with respect to absolute reproducibility errors between 1.5 Tesla and 3 Tesla at a constant spatial resolution. However, different reproducibility errors were found between the cartilage layers. One third of the data variability could be attributed to the influence of the different cartilage layers, and another 10% to the influence of the separate MRI slices. Conclusion: Our data provides an estimation of the global and regional reproducibility errors of T2 in healthy cartilage. In the analysis of small subregions, an increase in the regional reproducibility error must be accepted. The data may serve as a basis for sample size calculations of study populations and may contribute to the decision regarding the
TeSLA workshop betrouwbaar toetsen op afstand

NARCIS (Netherlands)

Brouns, Francis; Janssen, José

2017-01-01

Presentatie ten behoeve van workshop betrouwbaar toetsen op afstand voor docenten van de Open Universiteit Nederland betrokken in de derde TeSLA pilot. Topics: toetsfraude, toetsdesign, technologie voor authenticatie en verificatie van auteurschap, TeSLA instrumenten.
CPU and GPU (Cuda Template Matching Comparison

Directory of Open Access Journals (Sweden)

Evaldas Borcovas

2014-05-01

Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.
Ultrafast convolution/superposition using tabulated and exponential kernels on GPU

Energy Technology Data Exchange (ETDEWEB)

Chen Quan; Chen Mingli; Lu Weiguo [TomoTherapy Inc., 1240 Deming Way, Madison, Wisconsin 53717 (United States)

2011-03-15

Purpose: Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). Methods: The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. Results: As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Conclusions: Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.
Report on the TESLA Engineering Study/Review

Energy Technology Data Exchange (ETDEWEB)

Cornuelle, John C.

2002-08-30

In March, 2001, the TESLA Collaboration published its Technical Design Report (TDR, see references and links in Appendix), the first sentence of which stated ''...TESLA (TeV-Energy Superconducting Linear Collider) (will be) a superconducting electron-positron collider of initially 500 GeV total energy, extendable to 800 GeV, and an integrated X-ray laser laboratory.'' The TDR included cost and manpower estimates for a 500 GeV e{sup +}e{sup -} collider (250 on 250 GeV) based on superconducting RF cavity technology. This was submitted as a proposal to the German government. The government asked the German Science Council to evaluate this proposal. The recommendation from this body is anticipated to be available by November 2002. The government has indicated that it will react on this recommendation by mid-2003. In June 2001, Steve Holmes, Fermilab's Associate Director for Accelerators, commissioned Helen Edwards and Peter Garbincius to organize a study of the TESLA Technical Design Report and the associated cost and manpower estimates. Since the elements and methodology used in producing the TESLA cost estimate were somewhat different from those used in preparing similar estimates for projects within the U.S., it is important to understand the similarities, differences, and equivalences between the TESLA estimate and U.S. cost estimates. In particular, the project cost estimate includes only purchased equipment, materials, and services, but not manpower from DESY or other TESLA collaborating institutions, which is listed separately. It does not include the R&D on the TESLA Test Facility (TTF) nor the costs of preparing the TDR nor the costs of performing the conceptual studies so far. The manpower for the pre-operations commissioning program (up to beam) is included in the estimate, but not the electrical power or liquid Nitrogen (for initial cooldown of the cryogenics plant). There is no inclusion of any contingency or management reserve. If
Report on the TESLA Engineering Study/Review

International Nuclear Information System (INIS)

Cornuelle, John C.

2002-01-01

In March, 2001, the TESLA Collaboration published its Technical Design Report (TDR, see references and links in Appendix), the first sentence of which stated ''...TESLA (TeV-Energy Superconducting Linear Collider) (will be) a superconducting electron-positron collider of initially 500 GeV total energy, extendable to 800 GeV, and an integrated X-ray laser laboratory.'' The TDR included cost and manpower estimates for a 500 GeV e + e - collider (250 on 250 GeV) based on superconducting RF cavity technology. This was submitted as a proposal to the German government. The government asked the German Science Council to evaluate this proposal. The recommendation from this body is anticipated to be available by November 2002. The government has indicated that it will react on this recommendation by mid-2003. In June 2001, Steve Holmes, Fermilab's Associate Director for Accelerators, commissioned Helen Edwards and Peter Garbincius to organize a study of the TESLA Technical Design Report and the associated cost and manpower estimates. Since the elements and methodology used in producing the TESLA cost estimate were somewhat different from those used in preparing similar estimates for projects within the U.S., it is important to understand the similarities, differences, and equivalences between the TESLA estimate and U.S. cost estimates. In particular, the project cost estimate includes only purchased equipment, materials, and services, but not manpower from DESY or other TESLA collaborating institutions, which is listed separately. It does not include the R and D on the TESLA Test Facility (TTF) nor the costs of preparing the TDR nor the costs of performing the conceptual studies so far. The manpower for the pre-operations commissioning program (up to beam) is included in the estimate, but not the electrical power or liquid Nitrogen (for initial cooldown of the cryogenics plant). There is no inclusion of any contingency or management reserve. If the U.S. were to become
[Nikola Tesla in medicine, too].

Science.gov (United States)

Hanzek, Branko; Jakobović, Zvonimir

2007-12-01

Using primary and secondary sources we have shown in this paper the influence of Nikola Tesla's work on the field of medicine. The description of his experiments conduced within secondary-school education programs aimed to present the popularization of his work in Croatia. Although Tesla was dedicated primarily to physics and was not directly involved in biomedical research, his work significantly contributed to paving the way of medical physics particularly radiology and high-frequency electrotherapy.
Semi-automatic tool to ease the creation and optimization of GPU programs

DEFF Research Database (Denmark)

Jepsen, Jacob

2014-01-01

We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....
Multi-GPU implementation of a VMAT treatment plan optimization algorithm

International Nuclear Information System (INIS)

Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B.; Peng, Fei

2015-01-01

Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
Multi-GPU implementation of a VMAT treatment plan optimization algorithm

Energy Technology Data Exchange (ETDEWEB)

Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)

2015-06-15

Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
SU-D-BRD-03: A Gateway for GPU Computing in Cancer Radiotherapy Research

Energy Technology Data Exchange (ETDEWEB)

Jia, X; Folkerts, M [The University of Texas Southwestern Medical Ctr, Dallas, TX (United States); Shi, F; Yan, H; Yan, Y; Jiang, S [UT Southwestern Medical Center, Dallas, TX (United States); Sivagnanam, S; Majumdar, A [University of California San Diego, La Jolla, CA (United States)

2014-06-01

Purpose: Graphics Processing Unit (GPU) has become increasingly important in radiotherapy. However, it is still difficult for general clinical researchers to access GPU codes developed by other researchers, and for developers to objectively benchmark their codes. Moreover, it is quite often to see repeated efforts spent on developing low-quality GPU codes. The goal of this project is to establish an infrastructure for testing GPU codes, cross comparing them, and facilitating code distributions in radiotherapy community. Methods: We developed a system called Gateway for GPU Computing in Cancer Radiotherapy Research (GCR2). A number of GPU codes developed by our group and other developers can be accessed via a web interface. To use the services, researchers first upload their test data or use the standard data provided by our system. Then they can select the GPU device on which the code will be executed. Our system offers all mainstream GPU hardware for code benchmarking purpose. After the code running is complete, the system automatically summarizes and displays the computing results. We also released a SDK to allow the developers to build their own algorithm implementation and submit their binary codes to the system. The submitted code is then systematically benchmarked using a variety of GPU hardware and representative data provided by our system. The developers can also compare their codes with others and generate benchmarking reports. Results: It is found that the developed system is fully functioning. Through a user-friendly web interface, researchers are able to test various GPU codes. Developers also benefit from this platform by comprehensively benchmarking their codes on various GPU platforms and representative clinical data sets. Conclusion: We have developed an open platform allowing the clinical researchers and developers to access the GPUs and GPU codes. This development will facilitate the utilization of GPU in radiation therapy field.
Quantitative techniques for musculoskeletal MRI at 7 Tesla.

Science.gov (United States)

Bangerter, Neal K; Taylor, Meredith D; Tarbox, Grayson J; Palmer, Antony J; Park, Daniel J

2016-12-01

Whole-body 7 Tesla MRI scanners have been approved solely for research since they appeared on the market over 10 years ago, but may soon be approved for selected clinical neurological and musculoskeletal applications in both the EU and the United States. There has been considerable research work on musculoskeletal applications at 7 Tesla over the past decade, including techniques for ultra-high resolution morphological imaging, 3D T2 and T2* mapping, ultra-short TE applications, diffusion tensor imaging of cartilage, and several techniques for assessing proteoglycan content in cartilage. Most of this work has been done in the knee or other extremities, due to technical difficulties associated with scanning areas such as the hip and torso at 7 Tesla. In this manuscript, we first provide some technical context for 7 Tesla imaging, including challenges and potential advantages. We then review the major quantitative MRI techniques being applied to musculoskeletal applications on 7 Tesla whole-body systems.
Survey of using GPU CUDA programming model in medical image analysis

Directory of Open Access Journals (Sweden)

T. Kalaiselvi

2017-01-01

Full Text Available With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed.
STEM image simulation with hybrid CPU/GPU programming

International Nuclear Information System (INIS)

Yao, Y.; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

2016-01-01

STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.
STEM image simulation with hybrid CPU/GPU programming

Energy Technology Data Exchange (ETDEWEB)

Yao, Y., E-mail: yaoyuan@iphy.ac.cn; Ge, B.H.; Shen, X.; Wang, Y.G.; Yu, R.C.

2016-07-15

STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. - Highlights: • STEM image simulation is achieved by hybrid CPU/GPU programming under parallel algorithm architecture to speed up the calculation in the personal computer (PC). • In order to fully utilize the calculation power of the PC, the simulation is performed by GPU core and multi-CPU cores at the same time so efficiency is improved significantly. • GaSb and artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. The results reveal some unintuitive phenomena about the contrast variation with the atom numbers.
An efficient spectral crystal plasticity solver for GPU architectures

Science.gov (United States)

Malahe, Michael

2018-03-01

We present a spectral crystal plasticity (CP) solver for graphics processing unit (GPU) architectures that achieves a tenfold increase in efficiency over prior GPU solvers. The approach makes use of a database containing a spectral decomposition of CP simulations performed using a conventional iterative solver over a parameter space of crystal orientations and applied velocity gradients. The key improvements in efficiency come from reducing global memory transactions, exposing more instruction-level parallelism, reducing integer instructions and performing fast range reductions on trigonometric arguments. The scheme also makes more efficient use of memory than prior work, allowing for larger problems to be solved on a single GPU. We illustrate these improvements with a simulation of 390 million crystal grains on a consumer-grade GPU, which executes at a rate of 2.72 s per strain step.
Parallel generation of architecture on the GPU

KAUST Repository

Steinberger, Markus

2014-05-01

In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.

New Tesla-Experiment; Neue Tesla-Experimente

Energy Technology Data Exchange (ETDEWEB)

Wahl, Guenter; Harthun, Norbert

2010-07-01

Mysterious Teslar wave, microwave and scalar wave generators are presented, as well as exotic Star Wars experiments like mass accelerators and plasma guns. The third section describes, among others, a tube-driven Tesla generator with 50 cm streamers. The reader will also find a catalogue of Messrs. Information Unlimited, USA, who are providers of many of the kits, circuiting diagrams and apparatuses presented here. Main topic in this issue is wireless energy transfer and telecommunication engineering.(orig./GL)
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

Energy Technology Data Exchange (ETDEWEB)

Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl [Faculty of Physics, University of Warsaw, ul. Pasteura 5, 02-093 Warsaw (Poland); Rudnicki, Witold R. [Institute of Informatics, University of Białystok, ul. Konstantego Ciołkowskiego 1M, 15-245 Białystok (Poland); Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw (Poland); Majewski, Jacek A. [Faculty of Physics, University of Warsaw, ul. Pasteura 5, 02-093 Warsaw (Poland)

2016-09-15

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
Tesla the life and times of an electric messiah

CERN Document Server

Cawthorne, Nigel

2014-01-01

Despite being incredibly popular during his time, Nikola Tesla today remains largely overlooked among lists of the greatest inventors and scientists of the modern era. Thomas Edison gets all the glory for discovering the light bulb, but it was his one time assistant and life long arch nemesis, Tesla, who made the breakthrough in alternating current technology. Edison and Tesla carried on a bitter feud for years, but it was Tesla's AC generators that illuminated the 1893 World's Fair in Chicago; the first time that an event of such magnitude had ever taken place under artificial light. Today, all homes and electrical appliances run on Tesla's AC current.Born in Croatia in 1856, Tesla spoke eight languages and as well as almost single handedly developing household electricity. During his life, he patented more than 700 inventions. He invented electrical generators, FM radio, remote control robots, spark plugs and fluorescent lights. He had a photographic memory and did advanced calculus and physic...
TeSLA pilot 2 pedagogical & quality aspects

OpenAIRE

Janssen, José

2018-01-01

Presentation given at the TeSLA project meeting at the Open University of the Netherlands, addressing pedagogical aspects of pilot 2 and clarification of the scope and limitations of the TeSLA instruments with respect to pedagogy, assessment activity and type of academic dishonesty.
High performance cellular level agent-based simulation with FLAME for the GPU.

Science.gov (United States)

Richmond, Paul; Walker, Dawn; Coakley, Simon; Romano, Daniela

2010-05-01

Driven by the availability of experimental data and ability to simulate a biological scale which is of immediate interest, the cellular scale is fast emerging as an ideal candidate for middle-out modelling. As with 'bottom-up' simulation approaches, cellular level simulations demand a high degree of computational power, which in large-scale simulations can only be achieved through parallel computing. The flexible large-scale agent modelling environment (FLAME) is a template driven framework for agent-based modelling (ABM) on parallel architectures ideally suited to the simulation of cellular systems. It is available for both high performance computing clusters (www.flame.ac.uk) and GPU hardware (www.flamegpu.com) and uses a formal specification technique that acts as a universal modelling format. This not only creates an abstraction from the underlying hardware architectures, but avoids the steep learning curve associated with programming them. In benchmarking tests and simulations of advanced cellular systems, FLAME GPU has reported massive improvement in performance over more traditional ABM frameworks. This allows the time spent in the development and testing stages of modelling to be drastically reduced and creates the possibility of real-time visualisation for simple visual face-validation.
Parallel GPU implementation of PWR reactor burnup

International Nuclear Information System (INIS)

Heimlich, A.; Silva, F.C.; Martinez, A.S.

2016-01-01

Highlights: • Three GPU algorithms used to evaluate the burn-up in a PWR reactor. • Exhibit speed improvement exceeding 200 times over the sequential. • The C++ container is expansible to accept new nuclides chains. - Abstract: This paper surveys three methods, implemented for multi-core CPU and graphic processor unit (GPU), to evaluate the fuel burn-up in a pressurized light water nuclear reactor (PWR) using the solutions of a large system of coupled ordinary differential equations. The reactor physics simulation of a PWR reactor spends a long execution time with burnup calculations, so performance improvement using GPU can imply in better core design and thus extended fuel life cycle. The results of this study exhibit speed improvement exceeding 200 times over the sequential solver, within 1% accuracy.
GPU Accelerated Vector Median Filter

Science.gov (United States)

Aras, Rifat; Shen, Yuzhong

2011-01-01

Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
Development of parallel GPU based algorithms for problems in nuclear area

International Nuclear Information System (INIS)

Almeida, Adino Americo Heimlich

2009-01-01

Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
Non-Enhanced MR Imaging of Cerebral Arteriovenous Malformations at 7 Tesla.

Science.gov (United States)

Wrede, Karsten H; Dammann, Philipp; Johst, Sören; Mönninghoff, Christoph; Schlamann, Marc; Maderwald, Stefan; Sandalcioglu, I Erol; Ladd, Mark E; Forsting, Michael; Sure, Ulrich; Umutlu, Lale

2016-03-01

To evaluate prospectively 7 Tesla time-of-flight (TOF) magnetic resonance angiography (MRA) and 7 Tesla non-contrast-enhanced magnetization-prepared rapid acquisition gradient-echo (MPRAGE) for delineation of intracerebral arteriovenous malformations (AVMs) in comparison to 1.5 Tesla TOF MRA and digital subtraction angiography (DSA). Twenty patients with single or multifocal AVMs were enrolled in this trial. The study protocol comprised 1.5 and 7 Tesla TOF MRA and 7 Tesla non-contrast-enhanced MPRAGE sequences. All patients underwent an additional four-vessel 3D DSA. Image analysis of the following five AVM features was performed individually by two radiologists on a five-point scale: nidus, feeder(s), draining vein(s), relationship to adjacent vessels, and overall image quality and presence of artefacts. A total of 21 intracerebral AVMs were detected. Both sequences at 7 Tesla were rated superior over 1.5 Tesla TOF MRA in the assessment of all considered AVM features. Image quality at 7 Tesla was comparable with DSA considering both sequences. Inter-observer accordance was good to excellent for the majority of ratings. This study demonstrates excellent image quality for depiction of intracerebral AVMs using non-contrast-enhanced 7 Tesla MRA, comparable with DSA. Assessment of untreated AVMs is a promising clinical application of ultra-high-field MRA. • Non-contrast-enhanced 7 Tesla MRA demonstrates excellent image quality for intracerebral AVM depiction. • Image quality at 7 Tesla was comparable with DSA considering both sequences. • Assessment of intracerebral AVMs is a promising clinical application of ultra-high-field MRA.
GPU Pro 5 advanced rendering techniques

CERN Document Server

Engel, Wolfgang

2014-01-01

In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light
GPU credit reduced, tie to TMI-1 cheating discounted

International Nuclear Information System (INIS)

Utroska, D.

1981-01-01

The recent reduction of credit available to General Public Utilities (GPU) Nuclear may be linked to a cheating incident involving two reactor operators at the Three Mile Island-1 (TMI-1) reactor. The incident caused the Nuclear Regulatory Commission to reopen the managerial portion of the restart hearings and may delay the restart. The delay and the lower credit line will worsen GPU's financial position. Banks claim that misgivings about TMI-1 influence them more than the cheating, although GPU had been gradually improving its financial situation since the TMI-2 accident. The new agreement gives GPU $150 million in immediate credit, but lowers the interim ceiling from $292 million to $200 million. A spokesman from the Office of Management and Budget acknowledges that administration plans to limit the federal role to research and development softened under political pressure
Efficient Synchronization Primitives for GPUs

OpenAIRE

Stuart, Jeff A.; Owens, John D.

2011-01-01

In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla...
Tesla man out of time

CERN Document Server

Cheney, Margaret

1981-01-01

Called a madman by some, a genius by others, and an enigma by nearly everyone, Nikola Tesla created astonishing, world-transforming devises that were virtually without theoretical precedent. Tesla not only discovered the rotating magnetic field, the basis of most alternating current machinery, but also introduced the fundamentals of robotry, computers, and missile science and helped pave the way for such technologies as satellites, microwaves, beam weapons, and nuclear fusion. Almost supernaturally gifted, Tesla was also unusually erratic, flamboyant, and neurotic. He was J. P. Morgan's client, counted Mark Twain as a friend, and considered Thomas Edison an enemy. But above all, he was the hero and mentor to many of the last century's most famous scientists. In a meticulously researched, engagingly written biography, Margaret Cheney presents the many different dimensions of this extraordinary man, capturing his human qualities and quirks as she chronicles a lifetime of discoveries that continue to alter our ...
Emittance damping considerations for TESLA

International Nuclear Information System (INIS)

Floettmann, K.; Rossbach, J.

1993-03-01

Two schemes are considered to avoid very large damping rings for TESLA. The first (by K.F.) makes use of the linac tunnel to accomodate most of the damping 'ring' structure, which is, in fact, not a ring any more but a long linear structure with two small bends at each of its ends ('dog-bone'). The other scheme (by J.R.) is based on a positron (or electron, respectively) recycling scheme. It makes use of the specific TESLA property, that the full bunch train is much longer (240 km) than the linac length. The spent beams are recycled seven times after interaction, thus reducing the number of bunches to be stored in the damping ring by a factor of eight. Ultimately, this scheme can be used to operate TESLA in a storage ring mode ('storage linac'), with no damping ring at all. Finally, a combination of both schemes is considered. (orig.)
A real-time spike sorting method based on the embedded GPU.

Science.gov (United States)

Zelan Yang; Kedi Xu; Xiang Tian; Shaomin Zhang; Xiaoxiang Zheng

2017-07-01

Microelectrode arrays with hundreds of channels have been widely used to acquire neuron population signals in neuroscience studies. Online spike sorting is becoming one of the most important challenges for high-throughput neural signal acquisition systems. Graphic processing unit (GPU) with high parallel computing capability might provide an alternative solution for increasing real-time computational demands on spike sorting. This study reported a method of real-time spike sorting through computing unified device architecture (CUDA) which was implemented on an embedded GPU (NVIDIA JETSON Tegra K1, TK1). The sorting approach is based on the principal component analysis (PCA) and K-means. By analyzing the parallelism of each process, the method was further optimized in the thread memory model of GPU. Our results showed that the GPU-based classifier on TK1 is 37.92 times faster than the MATLAB-based classifier on PC while their accuracies were the same with each other. The high-performance computing features of embedded GPU demonstrated in our studies suggested that the embedded GPU provide a promising platform for the real-time neural signal processing.
Nikola Tesla: the man behind the magnetic field unit.

Science.gov (United States)

Roguin, Ariel

2004-03-01

The magnetic field strength of both the magnet and gradient coils used in MR imaging equipment is measured in Tesla units, which are named for Nikola Tesla. This article presents the life and achievements of this Serbian-American inventor and researcher who discovered the rotating magnetic field, the basis of most alternating-current machinery. Nikola Tesla had 700 patents in the United States and Europe that covered every aspect of science and technology. Tesla's discoveries include the Tesla coil, AC electrical conduction, improved lighting, newer forms of turbine engines, robotics, fluorescent light, wireless transmission of electrical energy, radio, remote control, discovery of cosmic radio waves, and the use of the ionosphere for scientific purposes. He was a genius whose discoveries had a pivotal role in advancing us into the modern era. Copyright 2004 Wiley-Liss, Inc.
Non-enhanced MR imaging of cerebral arteriovenous malformations at 7 Tesla

International Nuclear Information System (INIS)

Wrede, Karsten H.; Dammann, Philipp; Johst, Soeren; Maderwald, Stefan; Moenninghoff, Christoph; Forsting, Michael; Schlamann, Marc; Sandalcioglu, I.E.; Ladd, Mark E.; Sure, Ulrich; Umutlu, Lale

2016-01-01

To evaluate prospectively 7 Tesla time-of-flight (TOF) magnetic resonance angiography (MRA) and 7 Tesla non-contrast-enhanced magnetization-prepared rapid acquisition gradient-echo (MPRAGE) for delineation of intracerebral arteriovenous malformations (AVMs) in comparison to 1.5 Tesla TOF MRA and digital subtraction angiography (DSA). Twenty patients with single or multifocal AVMs were enrolled in this trial. The study protocol comprised 1.5 and 7 Tesla TOF MRA and 7 Tesla non-contrast-enhanced MPRAGE sequences. All patients underwent an additional four-vessel 3D DSA. Image analysis of the following five AVM features was performed individually by two radiologists on a five-point scale: nidus, feeder(s), draining vein(s), relationship to adjacent vessels, and overall image quality and presence of artefacts. A total of 21 intracerebral AVMs were detected. Both sequences at 7 Tesla were rated superior over 1.5 Tesla TOF MRA in the assessment of all considered AVM features. Image quality at 7 Tesla was comparable with DSA considering both sequences. Inter-observer accordance was good to excellent for the majority of ratings. This study demonstrates excellent image quality for depiction of intracerebral AVMs using non-contrast-enhanced 7 Tesla MRA, comparable with DSA. Assessment of untreated AVMs is a promising clinical application of ultra-high-field MRA. (orig.)
Non-enhanced MR imaging of cerebral arteriovenous malformations at 7 Tesla

Energy Technology Data Exchange (ETDEWEB)

Wrede, Karsten H.; Dammann, Philipp [University Duisburg-Essen, Erwin L. Hahn Institute for Magnetic Resonance Imaging, Essen (Germany); University Hospital Essen, Department of Neurosurgery, Essen (Germany); Johst, Soeren; Maderwald, Stefan [University Duisburg-Essen, Erwin L. Hahn Institute for Magnetic Resonance Imaging, Essen (Germany); Moenninghoff, Christoph; Forsting, Michael [University Hospital Essen, Department of Diagnostic and Interventional Radiology and Neuroradiology, Essen (Germany); Schlamann, Marc [University Hospital Essen, Department of Diagnostic and Interventional Radiology and Neuroradiology, Essen (Germany); University Hospital Giessen, Department of Neuroradiology, Giessen (Germany); Sandalcioglu, I.E. [University Hospital Essen, Department of Neurosurgery, Essen (Germany); Nordstadtkrankenhaus Hannover, Department of Neurosurgery, Hannover (Germany); Ladd, Mark E. [University Duisburg-Essen, Erwin L. Hahn Institute for Magnetic Resonance Imaging, Essen (Germany); University Hospital Essen, Department of Diagnostic and Interventional Radiology and Neuroradiology, Essen (Germany); German Cancer Research Center (DKFZ), Division of Medical Physics in Radiology (E020), Heidelberg (Germany); Sure, Ulrich [University Hospital Essen, Department of Neurosurgery, Essen (Germany); Umutlu, Lale [University Duisburg-Essen, Erwin L. Hahn Institute for Magnetic Resonance Imaging, Essen (Germany); University Hospital Essen, Department of Diagnostic and Interventional Radiology and Neuroradiology, Essen (Germany)

2016-03-15

To evaluate prospectively 7 Tesla time-of-flight (TOF) magnetic resonance angiography (MRA) and 7 Tesla non-contrast-enhanced magnetization-prepared rapid acquisition gradient-echo (MPRAGE) for delineation of intracerebral arteriovenous malformations (AVMs) in comparison to 1.5 Tesla TOF MRA and digital subtraction angiography (DSA). Twenty patients with single or multifocal AVMs were enrolled in this trial. The study protocol comprised 1.5 and 7 Tesla TOF MRA and 7 Tesla non-contrast-enhanced MPRAGE sequences. All patients underwent an additional four-vessel 3D DSA. Image analysis of the following five AVM features was performed individually by two radiologists on a five-point scale: nidus, feeder(s), draining vein(s), relationship to adjacent vessels, and overall image quality and presence of artefacts. A total of 21 intracerebral AVMs were detected. Both sequences at 7 Tesla were rated superior over 1.5 Tesla TOF MRA in the assessment of all considered AVM features. Image quality at 7 Tesla was comparable with DSA considering both sequences. Inter-observer accordance was good to excellent for the majority of ratings. This study demonstrates excellent image quality for depiction of intracerebral AVMs using non-contrast-enhanced 7 Tesla MRA, comparable with DSA. Assessment of untreated AVMs is a promising clinical application of ultra-high-field MRA. (orig.)
The GPU implementation of micro - Doppler period estimation

Science.gov (United States)

Yang, Liyuan; Wang, Junling; Bi, Ran

2018-03-01

Aiming at the problem that the computational complexity and the deficiency of real-time of the wideband radar echo signal, a program is designed to improve the performance of real-time extraction of micro-motion feature in this paper based on the CPU-GPU heterogeneous parallel structure. Firstly, we discuss the principle of the micro-Doppler effect generated by the rolling of the scattering points on the orbiting satellite, analyses how to use Kalman filter to compensate the translational motion of tumbling satellite and how to use the joint time-frequency analysis and inverse Radon transform to extract the micro-motion features from the echo after compensation. Secondly, the advantages of GPU in terms of real-time processing and the working principle of CPU-GPU heterogeneous parallelism are analysed, and a program flow based on GPU to extract the micro-motion feature from the radar echo signal of rolling satellite is designed. At the end of the article the results of extraction are given to verify the correctness of the program and algorithm.
GPU Pro 4 advanced rendering techniques

CERN Document Server

Engel, Wolfgang

2013-01-01

GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi

Application of GPU to Multi-interfaces Advection and Reconstruction Solver (MARS)

International Nuclear Information System (INIS)

Nagatake, Taku; Takase, Kazuyuki; Kunugi, Tomoaki

2010-01-01

In the nuclear engineering fields, a high performance computer system is necessary to perform the large scale computations. Recently, a Graphics Processing Unit (GPU) has been developed as a rendering computational system in order to reduce a Central Processing Unit (CPU) load. In the graphics processing, the high performance computing is needed to render the high-quality 3D objects in some video games. Thus the GPU consists of many processing units and a wide memory bandwidth. In this study, the Multi-interfaces Advection and Reconstruction Solver (MARS) which is one of the interface volume tracking methods for multi-phase flows has been performed. The multi-phase flow computation is very important for the nuclear reactors and other engineering fields. The MARS consists of two computing parts: the interface tracking part and the fluid motion computing part. As for the interface tracking part, the performance of GPU (GTX280) was 6 times faster than that of the CPU (Dual-Xeon 5040), and in the fluid motion computing part the Poisson Solver by the GPU (GTX285) was 22 times faster than that by the CPU(Core i7). As for the Dam Breaking Problem, the result of GPU-MARS showed slightly different from the experimental result. Because the GPU-MARS was developed using the single-precision GPU, it can be considered that the round-off error might be accumulated. (author)
High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm

Directory of Open Access Journals (Sweden)

Dieter Hendricks

2016-02-01

Full Text Available We implement a master-slave parallel genetic algorithm with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs to implement a parallel genetic algorithm and visualise the results using disjoint minimal spanning trees. We demonstrate that our GPU parallel genetic algorithm, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This approach represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable because of compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.
GPU Computing in Bayesian Inference of Realized Stochastic Volatility Model

International Nuclear Information System (INIS)

Takaishi, Tetsuya

2015-01-01

The realized stochastic volatility (RSV) model that utilizes the realized volatility as additional information has been proposed to infer volatility of financial time series. We consider the Bayesian inference of the RSV model by the Hybrid Monte Carlo (HMC) algorithm. The HMC algorithm can be parallelized and thus performed on the GPU for speedup. The GPU code is developed with CUDA Fortran. We compare the computational time in performing the HMC algorithm on GPU (GTX 760) and CPU (Intel i7-4770 3.4GHz) and find that the GPU can be up to 17 times faster than the CPU. We also code the program with OpenACC and find that appropriate coding can achieve the similar speedup with CUDA Fortran
1.5 versus 3 versus 7 Tesla in abdominal MRI: A comparative study.

Science.gov (United States)

Laader, Anja; Beiderwellen, Karsten; Kraff, Oliver; Maderwald, Stefan; Wrede, Karsten; Ladd, Mark E; Lauenstein, Thomas C; Forsting, Michael; Quick, Harald H; Nassenstein, Kai; Umutlu, Lale

2017-01-01

The aim of this study was to investigate and compare the feasibility as well as potential impact of altered magnetic field properties on image quality and potential artifacts of 1.5 Tesla, 3 Tesla and 7 Tesla non-enhanced abdominal MRI. Magnetic Resonance (MR) imaging of the upper abdomen was performed in 10 healthy volunteers on a 1.5 Tesla, a 3 Tesla and a 7 Tesla MR system. The study protocol comprised a (1) T1-weighted fat-saturated spoiled gradient-echo sequence (2D FLASH), (2) T1-weighted fat-saturated volumetric interpolated breath hold examination sequence (3D VIBE), (3) T1-weighted 2D in and opposed phase sequence, (4) True fast imaging with steady-state precession sequence (TrueFISP) and (5) T2-weighted turbo spin-echo (TSE) sequence. For comparison reasons field of view and acquisition times were kept comparable for each correlating sequence at all three field strengths, while trying to achieve the highest possible spatial resolution. Qualitative and quantitative analyses were tested for significant differences. While 1.5 and 3 Tesla MRI revealed comparable results in all assessed features and sequences, 7 Tesla MRI yielded considerable differences in T1 and T2 weighted imaging. Benefits of 7 Tesla MRI encompassed an increased higher spatial resolution and a non-enhanced hyperintense vessel signal at 7 Tesla, potentially offering a more accurate diagnosis of abdominal parenchymatous and vasculature disease. 7 Tesla MRI was also shown to be more impaired by artifacts, including residual B1 inhomogeneities, susceptibility and chemical shift artifacts, resulting in reduced overall image quality and overall image impairment ratings. While 1.5 and 3 Tesla T2w imaging showed equivalently high image quality, 7 Tesla revealed strong impairments in its diagnostic value. Our results demonstrate the feasibility and overall comparable imaging ability of T1-weighted 7 Tesla abdominal MRI towards 3 Tesla and 1.5 Tesla MRI, yielding a promising diagnostic potential for
Multi-GPU hybrid programming accelerated three-dimensional phase-field model in binary alloy

Directory of Open Access Journals (Sweden)

Changsheng Zhu

2018-03-01

Full Text Available In the process of dendritic growth simulation, the computational efficiency and the problem scales have extremely important influence on simulation efficiency of three-dimensional phase-field model. Thus, seeking for high performance calculation method to improve the computational efficiency and to expand the problem scales has a great significance to the research of microstructure of the material. A high performance calculation method based on MPI+CUDA hybrid programming model is introduced. Multi-GPU is used to implement quantitative numerical simulations of three-dimensional phase-field model in binary alloy under the condition of multi-physical processes coupling. The acceleration effect of different GPU nodes on different calculation scales is explored. On the foundation of multi-GPU calculation model that has been introduced, two optimization schemes, Non-blocking communication optimization and overlap of MPI and GPU computing optimization, are proposed. The results of two optimization schemes and basic multi-GPU model are compared. The calculation results show that the use of multi-GPU calculation model can improve the computational efficiency of three-dimensional phase-field obviously, which is 13 times to single GPU, and the problem scales have been expanded to 8193. The feasibility of two optimization schemes is shown, and the overlap of MPI and GPU computing optimization has better performance, which is 1.7 times to basic multi-GPU model, when 21 GPUs are used.
Tesla Coil Theoretical Model and its Experimental Verification

OpenAIRE

Voitkans Janis; Voitkans Arnis

2014-01-01

In this paper a theoretical model of Tesla coil operation is proposed. Tesla coil is described as a long line with distributed parameters in a single-wire form, where the line voltage is measured across electrically neutral space. By applying the principle of equivalence of single-wire and two-wire schemes an equivalent two-wire scheme can be found for a single-wire scheme and the already known long line theory can be applied to the Tesla coil. A new method of multiple re...
High performance GPU processing for inversion using uniform grid searches

Science.gov (United States)

Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios

2017-04-01

Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on
Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.

Science.gov (United States)

Eslami, Taban; Saeed, Fahad

2018-04-20

Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods.
Incompressible SPH (ISPH) with fast Poisson solver on a GPU

Science.gov (United States)

Chow, Alex D.; Rogers, Benedict D.; Lind, Steven J.; Stansby, Peter K.

2018-05-01

This paper presents a fast incompressible SPH (ISPH) solver implemented to run entirely on a graphics processing unit (GPU) capable of simulating several millions of particles in three dimensions on a single GPU. The ISPH algorithm is implemented by converting the highly optimised open-source weakly-compressible SPH (WCSPH) code DualSPHysics to run ISPH on the GPU, combining it with the open-source linear algebra library ViennaCL for fast solutions of the pressure Poisson equation (PPE). Several challenges are addressed with this research: constructing a PPE matrix every timestep on the GPU for moving particles, optimising the limited GPU memory, and exploiting fast matrix solvers. The ISPH pressure projection algorithm is implemented as 4 separate stages, each with a particle sweep, including an algorithm for the population of the PPE matrix suitable for the GPU, and mixed precision storage methods. An accurate and robust ISPH boundary condition ideal for parallel processing is also established by adapting an existing WCSPH boundary condition for ISPH. A variety of validation cases are presented: an impulsively started plate, incompressible flow around a moving square in a box, and dambreaks (2-D and 3-D) which demonstrate the accuracy, flexibility, and speed of the methodology. Fragmentation of the free surface is shown to influence the performance of matrix preconditioners and therefore the PPE matrix solution time. The Jacobi preconditioner demonstrates robustness and reliability in the presence of fragmented flows. For a dambreak simulation, GPU speed ups demonstrate up to 10-18 times and 1.1-4.5 times compared to single-threaded and 16-threaded CPU run times respectively.
[Studies of three-dimensional cardiac late gadolinium enhancement MRI at 3.0 Tesla].

Science.gov (United States)

Ishimoto, Takeshi; Ishihara, Masaru; Ikeda, Takayuki; Kawakami, Momoe

2008-12-20

Cardiac late Gadolinium enhancement MR imaging has been shown to allow assessment of myocardial viability in patients with ischemic heart disease. The current standard approach is a 3D inversion recovery sequence at 1.5 Tesla. The aims of this study were to evaluate the technique feasibility and clinical utility of MR viability imaging at 3.0 Tesla in patients with myocardial infarction and cardiomyopathy. In phantom and volunteer studies, the inversion time required to suppress the signal of interests and tissues was prolonged at 3.0 Tesla. In the clinical study, the average inversion time to suppress the signal of myocardium at 3.0 Tesla with respect to MR viability imaging at 1.5 Tesla was at 15 min after the administration of contrast agent (304.0+/-29.2 at 3.0 Tesla vs. 283.9+/-20.9 at 1.5 Tesla). The contrast between infarction and viable myocardium was equal at both field strengths (4.06+/-1.30 at 3.0 Tesla vs. 4.42+/-1.85 at 1.5 Tesla). Even at this early stage, MR viability imaging at 3.0 Tesla provides high quality images in patients with myocardial infarction. The inversion time is significantly prolonged at 3.0 Tesla. The contrast between infarction and viable myocardium at 3.0 Tesla are equal to 1.5 Tesla. Further investigation is needed for this technical improvement, for clinical evaluation, and for limitations.
GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda

Science.gov (United States)

2014-01-01

Background Non-coding sequences such as microRNAs have important roles in disease processes. Computational microRNA target identification (CMTI) is becoming increasingly important since traditional experimental methods for target identification pose many difficulties. These methods are time-consuming, costly, and often need guidance from computational methods to narrow down candidate genes anyway. However, most CMTI methods are computationally demanding, since they need to handle not only several million query microRNA and reference RNA pairs, but also several million nucleotide comparisons within each given pair. Thus, the need to perform microRNA identification at such large scale has increased the demand for parallel computing. Methods Although most CMTI programs (e.g., the miRanda algorithm) are based on a modified Smith-Waterman (SW) algorithm, the existing parallel SW implementations (e.g., CUDASW++ 2.0/3.0, SWIPE) are unable to meet this demand in CMTI tasks. We present CUDA-miRanda, a fast microRNA target identification algorithm that takes advantage of massively parallel computing on Graphics Processing Units (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). CUDA-miRanda specifically focuses on the local alignment of short (i.e., ≤ 32 nucleotides) sequences against longer reference sequences (e.g., 20K nucleotides). Moreover, the proposed algorithm is able to report multiple alignments (up to 191 top scores) and the corresponding traceback sequences for any given (query sequence, reference sequence) pair. Results Speeds over 5.36 Giga Cell Updates Per Second (GCUPs) are achieved on a server with 4 NVIDIA Tesla M2090 GPUs. Compared to the original miRanda algorithm, which is evaluated on an Intel Xeon E5620@2.4 GHz CPU, the experimental results show up to 166 times performance gains in terms of execution time. In addition, we have verified that the exact same targets were predicted in both CUDA-miRanda and the original mi
Nikola Tesla, the Ether and his Telautomaton

Science.gov (United States)

Milar, Kendall

2014-03-01

In the nineteenth century physicists' understanding of the ether changed dramatically. New developments in thermodynamics, energy physics, and electricity and magnetism dictated new properties of the ether. These have traditionally been examined from the perspective of the scientists re-conceptualizing the ether. However Nikola Tesla, a prolific inventor and writer, presents a different picture of nineteenth century physics. Alongside the displays that showcased his inventions he presented alternative interpretations of physical, physiological and even psychical research. This is particularly evident in his telautomaton, a radio remote controlled boat. This invention and Tesla's descriptions of it showcase some of his novel interpretations of physical theories. He offered a perspective on nineteenth century physics that focused on practical application instead of experiment. Sometimes the understanding of physical theories that Tesla reached was counterproductive to his own inventive work; other times he offered new insights. Tesla's utilitarian interpretation of physical theories suggests a more scientifically curious and invested inventor than previously described and a connection between the scientific and inventive communities.
GPU-based Parallel Application Design for Emerging Mobile Devices

Science.gov (United States)

Gupta, Kshitij

A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing

Directory of Open Access Journals (Sweden)

Fan Zhang

2016-04-01

Full Text Available With the development of synthetic aperture radar (SAR technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO. However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.

Science.gov (United States)

Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin

2016-04-07

With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
TESLA-N electron scattering with polarized targets at TESLA

International Nuclear Information System (INIS)

Korotokov, V.

2001-01-01

Measurements of polarized eN scattering can be realized at the TESLA linear collider facility at DESY with luminosities that are about two orders of magnitude higher than those expected for other experiments at comparable energies. A large variety of polarized parton distribution and fragmentation functions can be determined with unprecedented accuracy, many of them for the first time
Computing OpenSURF on OpenCL and General Purpose GPU

Directory of Open Access Journals (Sweden)

Wanglong Yan

2013-10-01

Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.
Tesla-transformer-type electron beam accelerator

International Nuclear Information System (INIS)

Liu Jinliang; Zhong Huihuang; Tan Qimei; Li Chuanlu; Zhang Jiande

2002-01-01

An electron-beam Tesla-transformer accelerator is described. It consists of the primary storage energy system. Tesla transformer, oil Blumlein pulse form line, and the vacuum diode. The experiments of initial stage showed that diode voltage rises up to about 500 kV with an input of 20 kV and the maximum electron-beam current is about 9 kA, the pulse width is about 50 ns. This device can operate stably and be set up easily
Hardware Acceleration of Sparse Cognitive Algorithms

Science.gov (United States)

2016-05-01

3 Figure 2: Kernel Run Times on an NVidia GTX750 Using the NVidia Profiler .......................... 5 Figure 3: 2D Implementation...Demonstrated Results Notes: (1) Server class GPU, NVidia Tesla K20c. (2) Consumer grade GPU, GeForce GT 640. Factor GPU Baseline SIMD with PiM ASIC...times on an NVidia GTX750 using the NVidia profiler. Figure 2: Kernel Run Times on an NVidia GTX750 Using the NVidia Profiler In this example
The performances of R GPU implementations of the GMRES method

Directory of Open Access Journals (Sweden)

Bogdan Oancea

2018-03-01

Full Text Available Although the performance of commodity computers has improved drastically with the introduction of multicore processors and GPU computing, the standard R distribution is still based on single-threaded model of computation, using only a small fraction of the computational power available now for most desktops and laptops. Modern statistical software packages rely on high performance implementations of the linear algebra routines there are at the core of several important leading edge statistical methods. In this paper we present a GPU implementation of the GMRES iterative method for solving linear systems. We compare the performance of this implementation with a pure single threaded version of the CPU. We also investigate the performance of our implementation using different GPU packages available now for R such as gmatrix, gputools or gpuR which are based on CUDA or OpenCL frameworks.

gPGA: GPU Accelerated Population Genetics Analyses.

Directory of Open Access Journals (Sweden)

Chunbao Zhou

Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.
GPU TECHNOLOGIES EMBODIED IN PARALLEL SOLVERS OF LINEAR ALGEBRAIC EQUATION SYSTEMS

Directory of Open Access Journals (Sweden)

Sidorov Alexander Vladimirovich

2012-10-01

Full Text Available The author reviews existing shareware solvers that are operated by graphical computer devices. The purpose of this review is to explore the opportunities and limitations of the above parallel solvers applicable for resolution of linear algebraic problems that arise at Research and Educational Centre of Computer Modeling at MSUCE, and Research and Engineering Centre STADYO. The author has explored new applications of the GPU in the PETSc suite and compared them with the results generated absent of the GPU. The research is performed within the CUSP library developed to resolve the problems of linear algebra through the application of GPU. The author has also reviewed the new MAGMA project which is analogous to LAPACK for the GPU.
Optimizing a mobile robot control system using GPU acceleration

Science.gov (United States)

Tuck, Nat; McGuinness, Michael; Martin, Fred

2012-01-01

This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
GPU's for event reconstruction in the FairRoot framework

International Nuclear Information System (INIS)

Al-Turany, M; Uhlig, F; Karabowicz, R

2010-01-01

FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
GPU-Accelerated Foreground Segmentation and Labeling for Real-Time Video Surveillance

Directory of Open Access Journals (Sweden)

Wei Song

2016-09-01

Full Text Available Real-time and accurate background modeling is an important researching topic in the fields of remote monitoring and video surveillance. Meanwhile, effective foreground detection is a preliminary requirement and decision-making basis for sustainable energy management, especially in smart meters. The environment monitoring results provide a decision-making basis for energy-saving strategies. For real-time moving object detection in video, this paper applies a parallel computing technology to develop a feedback foreground–background segmentation method and a parallel connected component labeling (PCCL algorithm. In the background modeling method, pixel-wise color histograms in graphics processing unit (GPU memory is generated from sequential images. If a pixel color in the current image does not locate around the peaks of its histogram, it is segmented as a foreground pixel. From the foreground segmentation results, a PCCL algorithm is proposed to cluster the foreground pixels into several groups in order to distinguish separate blobs. Because the noisy spot and sparkle in the foreground segmentation results always contain a small quantity of pixels, the small blobs are removed as noise in order to refine the segmentation results. The proposed GPU-based image processing algorithms are implemented using the compute unified device architecture (CUDA toolkit. The testing results show a significant enhancement in both speed and accuracy.
GPU based numerical simulation of core shooting process

Directory of Open Access Journals (Sweden)

Yi-zhong Zhang

2017-11-01

Full Text Available Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model (TFM and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit (GPU has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture (CUDA platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
GPU based contouring method on grid DEM data

Science.gov (United States)

Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong

2017-08-01

This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a "Grid Sorting" algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
GPU-based prompt gamma ray imaging from boron neutron capture therapy

International Nuclear Information System (INIS)

Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae; Jo Hong, Key; Sil Lee, Keum

2015-01-01

Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations
Simulating spin models on GPU

Science.gov (United States)

Weigel, Martin

2011-09-01

Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU

Directory of Open Access Journals (Sweden)

Yong Xia

2015-01-01

Full Text Available Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation and the other is the diffusion term of the monodomain model (partial differential equation. Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
GPU Acceleration of DSP for Communication Receivers.

Science.gov (United States)

Gunther, Jake; Gunther, Hyrum; Moon, Todd

2017-09-01

Graphics processing unit (GPU) implementations of signal processing algorithms can outperform CPU-based implementations. This paper describes the GPU implementation of several algorithms encountered in a wide range of high-data rate communication receivers including filters, multirate filters, numerically controlled oscillators, and multi-stage digital down converters. These structures are tested by processing the 20 MHz wide FM radio band (88-108 MHz). Two receiver structures are explored: a single channel receiver and a filter bank channelizer. Both run in real time on NVIDIA GeForce GTX 1080 graphics card.
Quick plasma equilibrium reconstruction based on GPU

International Nuclear Information System (INIS)

Xiao Bingjia; Huang, Y.; Luo, Z.P.; Yuan, Q.P.; Lao, L.

2014-01-01

A parallel code named P-EFIT which could complete an equilibrium reconstruction iteration in 250 μs is described. It is built with the CUDA TM architecture by using Graphical Processing Unit (GPU). It is described for the optimization of middle-scale matrix multiplication on GPU and an algorithm which could solve block tri-diagonal linear system efficiently in parallel. Benchmark test is conducted. Static test proves the accuracy of the P-EFIT and simulation-test proves the feasibility of using P-EFIT for real-time reconstruction on 65x65 computation grids. (author)
Proposed applications with implementation techniques of the upcoming renewable energy resource, The Tesla Turbine

International Nuclear Information System (INIS)

Khan, M Usman Saeed; Maqsood, M Irfan; Ali, Ehsan; Jamal, Shah; Javed, M

2013-01-01

Recent research has shown that tesla turbine can be one of the future efficient sources of renewable energy. Modern techniques used for designing of tesla turbine have given optimum results regarding efficiency and applications. In this paper we have suggested fully coordinated applications of tesla turbine in different fields particularly in power generation at both low level and high level generation. In Energy deficient countries the tesla turbine has wide range of applications and it can play an important role in energy management system. Our proposed applications includes, - the use of tesla turbine as renewable energy resource; - using tesla turbine in distributed generation system; - use of tesla turbine at home for power generation; - use of tesla turbine in irrigation channels; - using tesla turbine in hybrid electric vehicles; All applications are explained with the help of flow charts and block diagrams and their implementation techniques are also explained in details. The results of physical experiments and simulations are also included for some applications.
gWEGA: GPU-accelerated WEGA for molecular superposition and shape comparison.

Science.gov (United States)

Yan, Xin; Li, Jiabo; Gu, Qiong; Xu, Jun

2014-06-05

Virtual screening of a large chemical library for drug lead identification requires searching/superimposing a large number of three-dimensional (3D) chemical structures. This article reports a graphic processing unit (GPU)-accelerated weighted Gaussian algorithm (gWEGA) that expedites shape or shape-feature similarity score-based virtual screening. With 86 GPU nodes (each node has one GPU card), gWEGA can screen 110 million conformations derived from an entire ZINC drug-like database with diverse antidiabetic agents as query structures within 2 s (i.e., screening more than 55 million conformations per second). The rapid screening speed was accomplished through the massive parallelization on multiple GPU nodes and rapid prescreening of 3D structures (based on their shape descriptors and pharmacophore feature compositions). Copyright © 2014 Wiley Periodicals, Inc.
Advantages of GPU technology in DFT calculations of intercalated graphene

Science.gov (United States)

Pešić, J.; Gajić, R.

2014-09-01

Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
Advantages of GPU technology in DFT calculations of intercalated graphene

International Nuclear Information System (INIS)

Pešić, J; Gajić, R

2014-01-01

Over the past few years, the expansion of general-purpose graphic-processing unit (GPGPU) technology has had a great impact on computational science. GPGPU is the utilization of a graphics-processing unit (GPU) to perform calculations in applications usually handled by the central processing unit (CPU). Use of GPGPUs as a way to increase computational power in the material sciences has significantly decreased computational costs in already highly demanding calculations. A level of the acceleration and parallelization depends on the problem itself. Some problems can benefit from GPU acceleration and parallelization, such as the finite-difference time-domain algorithm (FTDT) and density-functional theory (DFT), while others cannot take advantage of these modern technologies. A number of GPU-supported applications had emerged in the past several years (www.nvidia.com/object/gpu-applications.html). Quantum Espresso (QE) is reported as an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nano-scale. It is based on DFT, the use of a plane-waves basis and a pseudopotential approach. Since the QE 5.0 version, it has been implemented as a plug-in component for standard QE packages that allows exploiting the capabilities of Nvidia GPU graphic cards (www.qe-forge.org/gf/proj). In this study, we have examined the impact of the usage of GPU acceleration and parallelization on the numerical performance of DFT calculations. Graphene has been attracting attention worldwide and has already shown some remarkable properties. We have studied an intercalated graphene, using the QE package PHonon, which employs GPU. The term ‘intercalation’ refers to a process whereby foreign adatoms are inserted onto a graphene lattice. In addition, by intercalating different atoms between graphene layers, it is possible to tune their physical properties. Our experiments have shown there are benefits from using GPUs, and we reached an
High performance MRI simulations of motion on multi-GPU systems.

Science.gov (United States)

Xanthis, Christos G; Venetis, Ioannis E; Aletras, Anthony H

2014-07-04

MRI physics simulators have been developed in the past for optimizing imaging protocols and for training purposes. However, these simulators have only addressed motion within a limited scope. The purpose of this study was the incorporation of realistic motion, such as cardiac motion, respiratory motion and flow, within MRI simulations in a high performance multi-GPU environment. Three different motion models were introduced in the Magnetic Resonance Imaging SIMULator (MRISIMUL) of this study: cardiac motion, respiratory motion and flow. Simulation of a simple Gradient Echo pulse sequence and a CINE pulse sequence on the corresponding anatomical model was performed. Myocardial tagging was also investigated. In pulse sequence design, software crushers were introduced to accommodate the long execution times in order to avoid spurious echoes formation. The displacement of the anatomical model isochromats was calculated within the Graphics Processing Unit (GPU) kernel for every timestep of the pulse sequence. Experiments that would allow simulation of custom anatomical and motion models were also performed. Last, simulations of motion with MRISIMUL on single-node and multi-node multi-GPU systems were examined. Gradient Echo and CINE images of the three motion models were produced and motion-related artifacts were demonstrated. The temporal evolution of the contractility of the heart was presented through the application of myocardial tagging. Better simulation performance and image quality were presented through the introduction of software crushers without the need to further increase the computational load and GPU resources. Last, MRISIMUL demonstrated an almost linear scalable performance with the increasing number of available GPU cards, in both single-node and multi-node multi-GPU computer systems. MRISIMUL is the first MR physics simulator to have implemented motion with a 3D large computational load on a single computer multi-GPU configuration. The incorporation
A new interlock design for the TESLA RF system

International Nuclear Information System (INIS)

Leich, H.; Kahl, J.; Choroba, S.; Grevsmuehl, T.; Heidbrook, N.

2001-01-01

The RF system for TESLA requires a comprehensive interlock system. Usually interlock systems are organized in a hierarchical way. In order to react to different fault conditions in a fast and flexible manner a nonhierarchical organization seems to be the better solution. At the TESLA Test Facility (TTF) at DESY the authors will install a nonhierarchical interlock system that is based on user designed reprogrammable gate-arrays (FPGA's) which incorporate an embedded microcontroller system. This system could be used later for the TESLA linear collider replacing a strictly hierarchical design
Living laboratory for Nikola Tesla. Living laboratories, Tesla, Second Life, sustainable construction technologies and renewable energy sources; Wohnlabor fuer Nikola Tesla. Ueber Wohnlabors, Tesla, Second Life, nachhaltige Bautechnologien und erneuerbare Energie

Energy Technology Data Exchange (ETDEWEB)

Redi, Ivan; Redi, Andrea; Jovanovic, Branimir (and others)

2008-07-01

Adventure is the opposite of conventional teaching. Adventure is the moment when experience alone is not enough. Sometimes, courageous people challenge the nature of things, helping us to get new insights and achieve a new viewpoint. The experience-oriented ''work in progress'' university is an adventure of this kind. The book looks into the Tesla laboratory and the Wardenclyffe Tower, both of which could not be completed for financial reasons, and addresses them from today's state of technology. The conceptional section is based on the ''Tesla doctrine'' which comprises fundamental philosophical statements on civilisatory progress. The book presents the results of the investigation. The 16 architectural projects presented here were developed live on the online platform. Second Life, ORTLOS Sim. (orig.)
An efficient implementation of 3D high-resolution imaging for large-scale seismic data with GPU/CPU heterogeneous parallel computing

Science.gov (United States)

Xu, Jincheng; Liu, Wei; Wang, Jin; Liu, Linong; Zhang, Jianfeng

2018-02-01

De-absorption pre-stack time migration (QPSTM) compensates for the absorption and dispersion of seismic waves by introducing an effective Q parameter, thereby making it an effective tool for 3D, high-resolution imaging of seismic data. Although the optimal aperture obtained via stationary-phase migration reduces the computational cost of 3D QPSTM and yields 3D stationary-phase QPSTM, the associated computational efficiency is still the main problem in the processing of 3D, high-resolution images for real large-scale seismic data. In the current paper, we proposed a division method for large-scale, 3D seismic data to optimize the performance of stationary-phase QPSTM on clusters of graphics processing units (GPU). Then, we designed an imaging point parallel strategy to achieve an optimal parallel computing performance. Afterward, we adopted an asynchronous double buffering scheme for multi-stream to perform the GPU/CPU parallel computing. Moreover, several key optimization strategies of computation and storage based on the compute unified device architecture (CUDA) were adopted to accelerate the 3D stationary-phase QPSTM algorithm. Compared with the initial GPU code, the implementation of the key optimization steps, including thread optimization, shared memory optimization, register optimization and special function units (SFU), greatly improved the efficiency. A numerical example employing real large-scale, 3D seismic data showed that our scheme is nearly 80 times faster than the CPU-QPSTM algorithm. Our GPU/CPU heterogeneous parallel computing framework significant reduces the computational cost and facilitates 3D high-resolution imaging for large-scale seismic data.

Tesla inventor of the electrical age

CERN Document Server

Carlson, W Bernard

2013-01-01

Nikola Tesla was a major contributor to the electrical revolution that transformed daily life at the turn of the twentieth century. His inventions, patents, and theoretical work formed the basis of modern AC electricity, and contributed to the development of radio and television. Like his competitor Thomas Edison, Tesla was one of America's first celebrity scientists, enjoying the company of New York high society and dazzling the likes of Mark Twain with his electrical demonstrations. An astute self-promoter and gifted showman, he cultivated a public image of the eccentric genius. Even at t
ALICE HLT high speed tracking on GPU

CERN Document Server

Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre

2011-01-01

The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...
Spiking neural networks on high performance computer clusters

Science.gov (United States)

Chen, Chong; Taha, Tarek M.

2011-09-01

In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.
Simulation of isothermal multi-phase fuel-coolant interaction using MPS method with GPU acceleration

Energy Technology Data Exchange (ETDEWEB)

Gou, W.; Zhang, S.; Zheng, Y. [Zhejiang Univ., Hangzhou (China). Center for Engineering and Scientific Computation

2016-07-15

The energetic fuel-coolant interaction (FCI) has been one of the primary safety concerns in nuclear power plants. Graphical processing unit (GPU) implementation of the moving particle semi-implicit (MPS) method is presented and used to simulate the fuel coolant interaction problem. The governing equations are discretized with the particle interaction model of MPS. Detailed implementation on single-GPU is introduced. The three-dimensional broken dam is simulated to verify the developed GPU acceleration MPS method. The proposed GPU acceleration algorithm and developed code are then used to simulate the FCI problem. As a summary of results, the developed GPU-MPS method showed a good agreement with the experimental observation and theoretical prediction.
GPU-based relative fuzzy connectedness image segmentation

International Nuclear Information System (INIS)

Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.

2013-01-01

Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ ∞ -based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
GPU-based relative fuzzy connectedness image segmentation

Energy Technology Data Exchange (ETDEWEB)

Zhuge Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W. [Radiation Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892 (United States); Department of Mathematics, West Virginia University, Morgantown, West Virginia 26506 (United States) and Medical Image Processing Group, Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 (United States); Medical Image Processing Group, Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 (United States); Radiation Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892 (United States)

2013-01-15

Purpose:Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an Script-Small-L {sub {infinity}}-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8 Multiplication-Sign , 22.9 Multiplication-Sign , 20.9 Multiplication-Sign , and 17.5 Multiplication-Sign , correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
GPU-based relative fuzzy connectedness image segmentation.

Science.gov (United States)

Zhuge, Ying; Ciesielski, Krzysztof C; Udupa, Jayaram K; Miller, Robert W

2013-01-01

Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. The most common FC segmentations, optimizing an [script-l](∞)-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA's Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology.
GPU-based relative fuzzy connectedness image segmentation

Science.gov (United States)

Zhuge, Ying; Ciesielski, Krzysztof C.; Udupa, Jayaram K.; Miller, Robert W.

2013-01-01

Purpose: Recently, clinical radiological research and practice are becoming increasingly quantitative. Further, images continue to increase in size and volume. For quantitative radiology to become practical, it is crucial that image segmentation algorithms and their implementations are rapid and yield practical run time on very large data sets. The purpose of this paper is to present a parallel version of an algorithm that belongs to the family of fuzzy connectedness (FC) algorithms, to achieve an interactive speed for segmenting large medical image data sets. Methods: The most common FC segmentations, optimizing an ℓ∞-based energy, are known as relative fuzzy connectedness (RFC) and iterative relative fuzzy connectedness (IRFC). Both RFC and IRFC objects (of which IRFC contains RFC) can be found via linear time algorithms, linear with respect to the image size. The new algorithm, P-ORFC (for parallel optimal RFC), which is implemented by using NVIDIA’s Compute Unified Device Architecture (CUDA) platform, considerably improves the computational speed of the above mentioned CPU based IRFC algorithm. Results: Experiments based on four data sets of small, medium, large, and super data size, achieved speedup factors of 32.8×, 22.9×, 20.9×, and 17.5×, correspondingly, on the NVIDIA Tesla C1060 platform. Although the output of P-ORFC need not precisely match that of IRFC output, it is very close to it and, as the authors prove, always lies between the RFC and IRFC objects. Conclusions: A parallel version of a top-of-the-line algorithm in the family of FC has been developed on the NVIDIA GPUs. An interactive speed of segmentation has been achieved, even for the largest medical image data set. Such GPU implementations may play a crucial role in automatic anatomy recognition in clinical radiology. PMID:23298094
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.

Directory of Open Access Journals (Sweden)

Chun-Liang Lee

Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Functionality of veterinary identification microchips following low- (0.5 tesla) and high-field (3 tesla) magnetic resonance imaging.

Science.gov (United States)

Piesnack, Susann; Frame, Mairi E; Oechtering, Gerhard; Ludewig, Eberhard

2013-01-01

The ability to read patient identification microchips relies on the use of radiofrequency pulses. Since radiofrequency pulses also form an integral part of the magnetic resonance imaging (MRI) process, the possibility of loss of microchip function during MRI scanning is of concern. Previous clinical trials have shown microchip function to be unaffected by MR imaging using a field strength of 1 Tesla and 1.5. As veterinary MRI scanners range widely in field strength, this study was devised to determine whether exposure to lower or higher field strengths than 1 Tesla would affect the function of different types of microchip. In a phantom study, a total of 300 International Standards Organisation (ISO)-approved microchips (100 each of three different types: ISO FDX-B 1.4 × 9 mm, ISO FDX-B 2.12 × 12 mm, ISO HDX 3.8 × 23 mm) were tested in a low field (0.5) and a high field scanner (3.0 Tesla). A total of 50 microchips of each type were tested in each scanner. The phantom was composed of a fluid-filled freezer pack onto which a plastic pillow and a cardboard strip with affixed microchips were positioned. Following an MRI scan protocol simulating a head study, all of the microchips were accurately readable. Neither 0.5 nor 3 Tesla imaging affected microchip function in this study. © 2013 Veterinary Radiology & Ultrasound.
GPU Parallel Bundle Block Adjustment

Directory of Open Access Journals (Sweden)

ZHENG Maoteng

2017-09-01

Full Text Available To deal with massive data in photogrammetry, we introduce the GPU parallel computing technology. The preconditioned conjugate gradient and inexact Newton method are also applied to decrease the iteration times while solving the normal equation. A brand new workflow of bundle adjustment is developed to utilize GPU parallel computing technology. Our method can avoid the storage and inversion of the big normal matrix, and compute the normal matrix in real time. The proposed method can not only largely decrease the memory requirement of normal matrix, but also largely improve the efficiency of bundle adjustment. It also achieves the same accuracy as the conventional method. Preliminary experiment results show that the bundle adjustment of a dataset with about 4500 images and 9 million image points can be done in only 1.5 minutes while achieving sub-pixel accuracy.
A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

International Nuclear Information System (INIS)

Liu, T.; Du, X.; Ji, W.; Xu, G.; Brown, F.B.

2013-01-01

For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed. (authors)
A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

Science.gov (United States)

Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.

2014-06-01

For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.
Instrumentation of the forward region of the TESLA detector

International Nuclear Information System (INIS)

Buesser, Karsten

2004-01-01

The expected beam-beam interaction at the proposed TESLA electron-positron linear collider has a significant impact on the design of the TESLA detector. Especially the instrumentation of the very forward region down to polar angles below 5 mrad will have to handle an immense background of electrons and positrons adding up to TeVs of energy deposition per bunch crossing. Instrumentation down to small angles is crucial not only for the measurement of the luminosity through Bhabha scattering, but also to maximize the hermeticity of the detector. Additionally these charged particles from beamstrahlung have to be measured as part of the feedback system of the TESLA accelerator and could also be used for beam diagnostics. The present design of the TESLA detector foresees two calorimeters in the forward region whose technologies have to meet the requirements regarding detector resolutions and radiation hardness. (orig.)
Overview of implementation of DARPA GPU program in SAIC

Science.gov (United States)

Braunreiter, Dennis; Furtek, Jeremy; Chen, Hai-Wen; Healy, Dennis

2008-04-01

This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation on the DARPA STAP-BOY program will focus on GPU implementation and algorithm innovations for a prototype radar STAP algorithm. The STAP algorithm will be implemented on the GPU, using stream programming (from companies such as PeakStream, ATI Technologies' CTM, and NVIDIA) and traditional graphics APIs. This algorithm will include fast range adaptive STAP weight updates and beamforming applications, each of which has been modified to exploit the parallel nature of graphics architectures.
High performance technique for database applicationsusing a hybrid GPU/CPU platform

KAUST Repository

Zidan, Mohammed A.; Bonny, Talal; Salama, Khaled N.

2012-01-01

Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm
Implementation and Optimization of GPU-Based Static State Security Analysis in Power Systems

Directory of Open Access Journals (Sweden)

Yong Chen

2017-01-01

Full Text Available Static state security analysis (SSSA is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy real-time requirements with CPU-based concurrent methods due to the intensive computations. A sensitivity analysis-based method with Graphics processing unit (GPU is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4-core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPU-based SSSA with optimized matrix operations can achieve a significant reduction in computation time.
Fast 3D elastic micro-seismic source location using new GPU features

Science.gov (United States)

Xue, Qingfeng; Wang, Yibo; Chang, Xu

2016-12-01

In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.
Development of a 1.0 MV 100 Hz compact tesla transformer with PFL

International Nuclear Information System (INIS)

Kang Qiang; Chang Anbi; Li Mingjia; Meng Fanbao; Su Youbin

2006-01-01

The theory and characteristic of a compact Tesla transformer are introduced, and an unitized configuration design is performed for 1.0 MV, 100 Hz Tesla transformer and 40 Ω, 40 ns pulse forming line (PFL). Two coaxial open cores in Tesla transformer serve as the inner and outer conductors of PFL, and a traditional PFL is combined with the Tesla transformer, then the pulse generator can be smaller, more efficient, and more stable. The designed compact Tesla transformer employed in electron beams accelerator CHP01 can charge PFL of 600 pF for 1.3 MV voltage at a single shot, and keep 1.15 MV at 100 Hz repeated rates. Furthermore, a continuance run in 5 seconds is achieved by Tesla transformer under voltage and frequency ratings. (authors)
TH-A-19A-12: A GPU-Accelerated and Monte Carlo-Based Intensity Modulated Proton Therapy Optimization System

Energy Technology Data Exchange (ETDEWEB)

Ma, J; Wan Chan Tseung, H; Beltran, C [Mayo Clinic, Rochester, MN (United States)

2014-06-15

Purpose: To develop a clinically applicable intensity modulated proton therapy (IMPT) optimization system that utilizes more accurate Monte Carlo (MC) dose calculation, rather than analytical dose calculation. Methods: A very fast in-house graphics processing unit (GPU) based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified gradient based optimization method was used to achieve the desired dose volume histograms (DVH). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve the spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that Result from maintaining the intrinsic CT resolution and large number of proton spots. The dose effects were studied particularly in cases with heterogeneous materials in comparison with the commercial treatment planning system (TPS). Results: For a relatively large and complex three-field bi-lateral head and neck case (i.e. >100K spots with a target volume of ∼1000 cc and multiple surrounding critical structures), the optimization together with the initial MC dose influence map calculation can be done in a clinically viable time frame (i.e. less than 15 minutes) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The DVHs of the MC TPS plan compare favorably with those of a commercial treatment planning system. Conclusion: A GPU accelerated and MC-based IMPT optimization system was developed. The dose calculation and plan optimization can be performed in less than 15 minutes on a hardware system costing less than 45,000 dollars. The fast calculation and optimization makes the system easily expandable to robust and multi-criteria optimization. This work was funded in part by a grant from Varian Medical Systems, Inc.

TH-A-19A-12: A GPU-Accelerated and Monte Carlo-Based Intensity Modulated Proton Therapy Optimization System

International Nuclear Information System (INIS)

Ma, J; Wan Chan Tseung, H; Beltran, C

2014-01-01

Purpose: To develop a clinically applicable intensity modulated proton therapy (IMPT) optimization system that utilizes more accurate Monte Carlo (MC) dose calculation, rather than analytical dose calculation. Methods: A very fast in-house graphics processing unit (GPU) based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified gradient based optimization method was used to achieve the desired dose volume histograms (DVH). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve the spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that Result from maintaining the intrinsic CT resolution and large number of proton spots. The dose effects were studied particularly in cases with heterogeneous materials in comparison with the commercial treatment planning system (TPS). Results: For a relatively large and complex three-field bi-lateral head and neck case (i.e. >100K spots with a target volume of ∼1000 cc and multiple surrounding critical structures), the optimization together with the initial MC dose influence map calculation can be done in a clinically viable time frame (i.e. less than 15 minutes) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The DVHs of the MC TPS plan compare favorably with those of a commercial treatment planning system. Conclusion: A GPU accelerated and MC-based IMPT optimization system was developed. The dose calculation and plan optimization can be performed in less than 15 minutes on a hardware system costing less than 45,000 dollars. The fast calculation and optimization makes the system easily expandable to robust and multi-criteria optimization. This work was funded in part by a grant from Varian Medical Systems, Inc
Nikola Tesla and robotics

Directory of Open Access Journals (Sweden)

Vukobratović Miomir

2006-01-01

Full Text Available The paper analyzes some of Tesla's works and his most remarkable views concerning the problem of formulating theoretical bases of automatic control. As a tribute to Tesla's work on remote control of automated systems, as well to his (at the time far-seeing visions, special attention is paid to solving complex problem of control and feedback application. A more detailed discussion of the way and origin of formulating theoretical bases of automatic control are given. Besides, in more detail are presented the related pioneering works of Professor Nicholas Bernstein, great Russian physiologist who formulated the basic rules of the self-regulating movements of the man. Bernstein has achievements of highest scientific significance that has been in a direct function of identifying and proving the priority of his pioneering contributions in the domain of feedback, i.e. control and principles of cybernetics.
Collision detection of convex polyhedra on the NVIDIA GPU architecture for the discrete element method

CSIR Research Space (South Africa)

Govender, Nicolin

2015-09-01

Full Text Available consideration due to the architectural differences between CPU and GPU platforms. This paper describes the DEM algorithms and heuristics that are optimized for the parallel NVIDIA Kepler GPU architecture in detail. This includes a GPU optimized collision...
SU-E-J-60: Efficient Monte Carlo Dose Calculation On CPU-GPU Heterogeneous Systems

Energy Technology Data Exchange (ETDEWEB)

Xiao, K; Chen, D. Z; Hu, X. S [University of Notre Dame, Notre Dame, IN (United States); Zhou, B [Altera Corp., San Jose, CA (United States)

2014-06-01

Purpose: It is well-known that the performance of GPU-based Monte Carlo dose calculation implementations is bounded by memory bandwidth. One major cause of this bottleneck is the random memory writing patterns in dose deposition, which leads to several memory efficiency issues on GPU such as un-coalesced writing and atomic operations. We propose a new method to alleviate such issues on CPU-GPU heterogeneous systems, which achieves overall performance improvement for Monte Carlo dose calculation. Methods: Dose deposition is to accumulate dose into the voxels of a dose volume along the trajectories of radiation rays. Our idea is to partition this procedure into the following three steps, which are fine-tuned for CPU or GPU: (1) each GPU thread writes dose results with location information to a buffer on GPU memory, which achieves fully-coalesced and atomic-free memory transactions; (2) the dose results in the buffer are transferred to CPU memory; (3) the dose volume is constructed from the dose buffer on CPU. We organize the processing of all radiation rays into streams. Since the steps within a stream use different hardware resources (i.e., GPU, DMA, CPU), we can overlap the execution of these steps for different streams by pipelining. Results: We evaluated our method using a Monte Carlo Convolution Superposition (MCCS) program and tested our implementation for various clinical cases on a heterogeneous system containing an Intel i7 quad-core CPU and an NVIDIA TITAN GPU. Comparing with a straightforward MCCS implementation on the same system (using both CPU and GPU for radiation ray tracing), our method gained 2-5X speedup without losing dose calculation accuracy. Conclusion: The results show that our new method improves the effective memory bandwidth and overall performance for MCCS on the CPU-GPU systems. Our proposed method can also be applied to accelerate other Monte Carlo dose calculation approaches. This research was supported in part by NSF under Grants CCF
A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU

Science.gov (United States)

Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha

2018-03-01

Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.
Construction of 0.15 Tesla Overhauser Enhanced MRI.

Science.gov (United States)

Tokunaga, Yuumi; Nakao, Motonao; Naganuma, Tatsuya; Ichikawa, Kazuhiro

2017-01-01

Overhauser enhanced MRI (OMRI) is one of the free radical imaging technologies and has been used in biomedical research such as for partial oxygen measurements in tumor, and redox status in acute oxidative diseases. The external magnetic field of OMRI is frequently in the range of 5-10 mTesla to ensure microwave penetration into small animals, and the S/N ratio is limited. In this study, a 0.15 Tesla OMRI was constructed and tested to improve the S/N ratio for a small sample, or skin measurement. Specification of the main magnet was as follows: 0.15 Tesla permanent magnet; gap size 160 mm; homogenous spherical volume of 80 mm in diameter. The OMRI resonator was designed based on TE 101 cavity mode and machined from a phosphorus deoxidized copper block for electron spin resonance (ESR) excitation and a solenoid transmission/receive resonator for NMR detection. The resonant frequencies and Q values were 6.38 MHz/150 and 4.31-4.41 GHz/120 for NMR and ESR, respectively. The Q values were comparable to those of conventional low field OMRI resonators at 15 mTesla. As expected, the MRI S/N ratio was improved by a factor of 30. Triplet dynamic nuclear polarization spectra were observed for 14 N carboxy-PROXYL, along the excitation microwave sweep. In the current setup, the enhancement factor was ca. 0.5. In conclusion, the results of this preliminary evaluation indicate that the 0.15 Tesla OMRI could be useful for free radical measurement for small samples.
A 4 Tesla/1 meter superferric MRI magnet

International Nuclear Information System (INIS)

Schmidt, W.M.; Huson, F.R.; Mackay, W.W.; Rocha, R.M.

1991-01-01

Superferric technology was first applied to Magnetic Resonance Imaging (MRI) magnets by the Texas Accelerator Center (TAC) in 1986 with the design and construction of a 4 Tesla/30 cm magnet. In an evolutionary step, this technology is now being applied to the development of a whole body 4 Tesla/1 meter superconducting magnet. The design of such a magnetis presented in this paper
How General-Purpose can a GPU be?

Directory of Open Access Journals (Sweden)

Philip Machanick

2015-12-01

Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
76 FR 60124 - Tesla Motors, Inc.; Grant of Petition for Temporary Exemption From the Electronic Stability...

Science.gov (United States)

2011-09-28

...-0110] Tesla Motors, Inc.; Grant of Petition for Temporary Exemption From the Electronic Stability... notice grants the petition of Tesla Motors, Inc. (Tesla) for the temporary exemption of its Roadster... procedures in 49 CFR Part 555, Tesla Motors, Inc. (Tesla) submitted a petition dated June 7, 2011 asking the...
TeSLA e-assessment workshop

NARCIS (Netherlands)

Janssen, José

2016-01-01

Presentatie ten behoeve van de e-assessment workshop voor docenten van de Open Universiteit Nederland betrokken in de eerste TeSLA pilot. Topics: toetsfraude, toetsdesign, technologie voor authenticatie en verificatie van auteurschap.
Technical challenges of superconductivity and cryogenics in pursuing TESLA-TTF

International Nuclear Information System (INIS)

Shu, Quan-Sheng

1996-01-01

TESLA (TeV Energy Superconducting Linear Accelerator) Collaboration is an international R ampersand D effort towards the development of an e + e - linear collider with 500 GeV center of mass by means of 20 km active superconducting accelerating structures at a frequency of 1.3 GHz. The ultimate challenges faced by the TESLA project are (1) to raise operational accelerating gradients to 25 MV/m from current world level of 5-10 MV/m, and (2) to reduce construction costs (cryomodules, klystrons, etc.) down to $2,000/MV from now about $40,000/MV. The TESLA Collaboration is building a prototype TESLA test facility (TTF) of a 500 MeV superconducting linear accelerator to establish the technical basis. TTF is presently under construction and will be commissioned at DESY in 1997, through the joint efforts of 24 laboratories from 8 countries. Significant progress has been made in reaching the high accelerating gradient of 25 MV/m in superconducting cavities, developing cryomodules and constructing TTF infrastructure, etc. This paper will briefly discuss the challenges being faced and review the progress achieved in the technical area of superconductivity and cryogenics by the TESLA Collaboration
Acceleration for 2D time-domain elastic full waveform inversion using a single GPU card

Science.gov (United States)

Jiang, Jinpeng; Zhu, Peimin

2018-05-01

Full waveform inversion (FWI) is a challenging procedure due to the high computational cost related to the modeling, especially for the elastic case. The graphics processing unit (GPU) has become a popular device for the high-performance computing (HPC). To reduce the long computation time, we design and implement the GPU-based 2D elastic FWI (EFWI) in time domain using a single GPU card. We parallelize the forward modeling and gradient calculations using the CUDA programming language. To overcome the limitation of relatively small global memory on GPU, the boundary saving strategy is exploited to reconstruct the forward wavefield. Moreover, the L-BFGS optimization method used in the inversion increases the convergence of the misfit function. A multiscale inversion strategy is performed in the workflow to obtain the accurate inversion results. In our tests, the GPU-based implementations using a single GPU device achieve >15 times speedup in forward modeling, and about 12 times speedup in gradient calculation, compared with the eight-core CPU implementations optimized by OpenMP. The test results from the GPU implementations are verified to have enough accuracy by comparing the results obtained from the CPU implementations.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU

Science.gov (United States)

Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong

2014-03-01

A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNRe., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
MODERN ELECTRIC CARS OF TESLA MOTORS COMPANY

OpenAIRE

O. F. Vynakov; E. V. Savolova; A. I. Skrynnyk

2016-01-01

This overview article shows the advantages of a modern electric car as compared with internal combustion cars by the example of the electric vehicles of Tesla Motors Company. It (в смысле- статья) describes the history of this firm, provides technical and tactical characteristics of three modifications of electric vehicles produced by Tesla Motors. Modern electric cars are not less powerful than cars with combustion engines both in speed and acceleration amount. They are reliable, economical ...
Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

Science.gov (United States)

Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo

2018-01-01

We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of 26% in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods

Science.gov (United States)

Xie, Lang; Luo, Yi-han; Bao, Qi-liang

2013-08-01

GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.

Science.gov (United States)

Mei, Gang; Xu, Nengxiong; Xu, Liangliang

2016-01-01

This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

Science.gov (United States)

Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

2018-01-01

In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.
Nikola Tesla: the Moon's rotation.

Science.gov (United States)

Tomić, A.; Jovanović, B. S.

1993-09-01

The review of three articles by N. Tesla, published in the year 1919 in the journal "Electrical experimenter" is given, with special reference to the astronomical contents and to circumstances in which they appeared.
GPU accelerated CT reconstruction for clinical use: quality driven performance

Science.gov (United States)

Vaz, Michael S.; Sneyders, Yuri; McLin, Matthew; Ricker, Alan; Kimpe, Tom

2007-03-01

We present performance and quality analysis of GPU accelerated FDK filtered backprojection for cone beam computed tomography (CBCT) reconstruction. Our implementation of the FDK CT reconstruction algorithm does not compromise fidelity at any stage and yields a result that is within 1 HU of a reference C++ implementation. Our streaming implementation is able to perform reconstruction as the images are acquired; it addresses low latency as well as fast throughput, which are key considerations for a "real-time" design. Further, it is scaleable to multiple GPUs for increased performance. The implementation does not place any constraints on image acquisition; it works effectively for arbitrary angular coverage with arbitrary angular spacing. As such, this GPU accelerated CT reconstruction solution may easily be used with scanners that are already deployed. We are able to reconstruct a 512 x 512 x 340 volume from 625 projections, each sized 1024 x 768, in less than 50 seconds. The quoted 50 second timing encompasses the entire reconstruction using bilinear interpolation and includes filtering on the CPU, uploading the filtered projections to the GPU, and also downloading the reconstructed volume from GPU memory to system RAM.

GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo

Energy Technology Data Exchange (ETDEWEB)

Kim, H; Duchaineau, M; Max, N

2011-09-21

We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Fast Simulation of Dynamic Ultrasound Images Using the GPU.

Science.gov (United States)

Storve, Sigurd; Torp, Hans

2017-10-01

Simulated ultrasound data is a valuable tool for development and validation of quantitative image analysis methods in echocardiography. Unfortunately, simulation time can become prohibitive for phantoms consisting of a large number of point scatterers. The COLE algorithm by Gao et al. is a fast convolution-based simulator that trades simulation accuracy for improved speed. We present highly efficient parallelized CPU and GPU implementations of the COLE algorithm with an emphasis on dynamic simulations involving moving point scatterers. We argue that it is crucial to minimize the amount of data transfers from the CPU to achieve good performance on the GPU. We achieve this by storing the complete trajectories of the dynamic point scatterers as spline curves in the GPU memory. This leads to good efficiency when simulating sequences consisting of a large number of frames, such as B-mode and tissue Doppler data for a full cardiac cycle. In addition, we propose a phase-based subsample delay technique that efficiently eliminates flickering artifacts seen in B-mode sequences when COLE is used without enough temporal oversampling. To assess the performance, we used a laptop computer and a desktop computer, each equipped with a multicore Intel CPU and an NVIDIA GPU. Running the simulator on a high-end TITAN X GPU, we observed two orders of magnitude speedup compared to the parallel CPU version, three orders of magnitude speedup compared to simulation times reported by Gao et al. in their paper on COLE, and a speedup of 27000 times compared to the multithreaded version of Field II, using numbers reported in a paper by Jensen. We hope that by releasing the simulator as an open-source project we will encourage its use and further development.
Multicore and GPU algorithms for Nussinov RNA folding

Science.gov (United States)

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
Development of High-speed Visualization System of Hypocenter Data Using CUDA-based GPU computing

Science.gov (United States)

Kumagai, T.; Okubo, K.; Uchida, N.; Matsuzawa, T.; Kawada, N.; Takeuchi, N.

2014-12-01

After the Great East Japan Earthquake on March 11, 2011, intelligent visualization of seismic information is becoming important to understand the earthquake phenomena. On the other hand, to date, the quantity of seismic data becomes enormous as a progress of high accuracy observation network; we need to treat many parameters (e.g., positional information, origin time, magnitude, etc.) to efficiently display the seismic information. Therefore, high-speed processing of data and image information is necessary to handle enormous amounts of seismic data. Recently, GPU (Graphic Processing Unit) is used as an acceleration tool for data processing and calculation in various study fields. This movement is called GPGPU (General Purpose computing on GPUs). In the last few years the performance of GPU keeps on improving rapidly. GPU computing gives us the high-performance computing environment at a lower cost than before. Moreover, use of GPU has an advantage of visualization of processed data, because GPU is originally architecture for graphics processing. In the GPU computing, the processed data is always stored in the video memory. Therefore, we can directly write drawing information to the VRAM on the video card by combining CUDA and the graphics API. In this study, we employ CUDA and OpenGL and/or DirectX to realize full-GPU implementation. This method makes it possible to write drawing information to the VRAM on the video card without PCIe bus data transfer: It enables the high-speed processing of seismic data. The present study examines the GPU computing-based high-speed visualization and the feasibility for high-speed visualization system of hypocenter data.
Cortical microinfarcts detected in vivo on 3 tesla MRI : Clinical and radiological correlates

NARCIS (Netherlands)

Van Dalen, Jan Willem; Scuric, Eva E M; Van Veluw, Susanne J.; Caan, Matthan W A; Nederveen, Aart J.; Biessels, Geert Jan; Van Gool, Willem A.; Richard, Edo

2015-01-01

Background and Purpose-Cortical microinfarcts (CMIs) are a common postmortem finding associated with vascular risk factors, cognitive decline, and dementia. Recently, CMIs identified in vivo on 7 Tesla MRI also proved retraceable on 3 Tesla MRI. Methods-We evaluated CMIs on 3 Tesla MRI in a
Cortical microinfarcts detected in vivo on 3 Tesla MRI: clinical and radiological correlates

NARCIS (Netherlands)

Dalen, J.W. van; Scuric, E.E.; Veluw, S.J. van; Caan, M.W.; Nederveen, A.J.; Biessels, G.J.; Gool, W.A. van; Richard, E.

2015-01-01

BACKGROUND AND PURPOSE: Cortical microinfarcts (CMIs) are a common postmortem finding associated with vascular risk factors, cognitive decline, and dementia. Recently, CMIs identified in vivo on 7 Tesla MRI also proved retraceable on 3 Tesla MRI. METHODS: We evaluated CMIs on 3 Tesla MRI in a
GPU-accelerated back-projection revisited. Squeezing performance by careful tuning

Energy Technology Data Exchange (ETDEWEB)

Papenhausen, Eric; Zheng, Ziyi; Mueller, Klaus [Stony Brook Univ., NY (United States). Computer Science Dept.

2011-07-01

In recent years, GPUs have become an increasingly popular tool in computed tomography (CT) reconstruction. In this paper, we discuss performance optimization techniques for a GPU-based filtered-backprojection reconstruction implementation. We explore the different optimization techniques we used and explain how those techniques affected performance. Our results show a nearly 50% increase in performance when compared to the current top ranked GPU implementation. (orig.)
Undulator systems for the TESLA X-FEL

International Nuclear Information System (INIS)

Pflueger, J.; Tischer, M.

2002-01-01

A large X-ray FEL lab is under consideration within the TESLA project and is supposed to be operated in parallel with the TESLA linear collider. There will be five SASE FELs and five conventional spontaneous undulators. A conceptual design study has been made for the undulator systems for these X-FELs. It includes segmentation into 6.1 m long undulator 'cells'. Each consists of a 5 m long undulator 'segment', a separate quadrupole, one horizontal and one vertical corrector, and a phase shifter. These items are presented and discussed
Energy Spread Sources in TESLA and TTF

International Nuclear Information System (INIS)

Mosnier, A.; Tessier, J.M.

1995-03-01

The beam energy spread in the TESLA linac must be small enough to limit the emittance dilution due to the dispersive effects. This report summarizes the major sources of energy spread both for the TESLA linac and the TTF linac, where these estimations will be carefully checked with beam experiments. The first part recalls the intra-bunch energy spread while the second part looks into the bunch-to-bunch energy spread induced by rf field fluctuations within the bunch train and from pulse-to-pulse. (author). 3 refs., 4 figs
Radiation protection systems on the TESLA Accelerator Installation

International Nuclear Information System (INIS)

Pavlovic, R.

1996-01-01

In the Institute of Nuclear sciences VINCA, the Accelerator Installation TESLA which is an medium energy ion accelerator facility consisting of an isochronous cyclotron VINCY, a heavy ion source, a D/H ion source, three low energy and five high energy experimental channels is now under construction. Some problems in defining radiation protection and safety programme, particularly problems in construction appropriate shielding barriers at the Accelerator Installation TESLA are discussed in this paper. (author
Superconducting magnet package for the TESLA test facility

International Nuclear Information System (INIS)

Koski, A.; Bandelmann, R.; Wolff, S.

1996-01-01

The magnetic lattice of the TeV electron superconducting linear accelerator (TESLA) will consist of superconducting quadrupoles for beam focusing and superconducting correction dipoles for beam steering, incorporated in the cryostats containing the superconducting cavities. This report describes the design of these magnets, presenting details of the magnetic as well as the mechanical design. The measured characteristics of the TESLA Test Facility (TTF) quadrupoles and dipoles are compared to the results obtained from numerical computations
Ramses-GPU: Second order MUSCL-Handcock finite volume fluid solver

Science.gov (United States)

Kestener, Pierre

2017-10-01

RamsesGPU is a reimplementation of RAMSES (ascl:1011.007) which drops the adaptive mesh refinement (AMR) features to optimize 3D uniform grid algorithms for modern graphics processor units (GPU) to provide an efficient software package for astrophysics applications that do not need AMR features but do require a very large number of integration time steps. RamsesGPU provides an very efficient C++/CUDA/MPI software implementation of a second order MUSCL-Handcock finite volume fluid solver for compressible hydrodynamics as a magnetohydrodynamics solver based on the constraint transport technique. Other useful modules includes static gravity, dissipative terms (viscosity, resistivity), and forcing source term for turbulence studies, and special care was taken to enhance parallel input/output performance by using state-of-the-art libraries such as HDF5 and parallel-netcdf.
A GPU-paralleled implementation of an enhanced face recognition algorithm

Science.gov (United States)

Chen, Hao; Liu, Xiyang; Shao, Shuai; Zan, Jiguo

2013-03-01

Face recognition algorithm based on compressed sensing and sparse representation is hotly argued in these years. The scheme of this algorithm increases recognition rate as well as anti-noise capability. However, the computational cost is expensive and has become a main restricting factor for real world applications. In this paper, we introduce a GPU-accelerated hybrid variant of face recognition algorithm named parallel face recognition algorithm (pFRA). We describe here how to carry out parallel optimization design to take full advantage of many-core structure of a GPU. The pFRA is tested and compared with several other implementations under different data sample size. Finally, Our pFRA, implemented with NVIDIA GPU and Computer Unified Device Architecture (CUDA) programming model, achieves a significant speedup over the traditional CPU implementations.
Graphics processing unit (GPU) real-time infrared scene generation

Science.gov (United States)

Christie, Chad L.; Gouthas, Efthimios (Themie); Williams, Owen M.

2007-04-01

VIRSuite, the GPU-based suite of software tools developed at DSTO for real-time infrared scene generation, is described. The tools include the painting of scene objects with radiometrically-associated colours, translucent object generation, polar plot validation and versatile scene generation. Special features include radiometric scaling within the GPU and the presence of zoom anti-aliasing at the core of VIRSuite. Extension of the zoom anti-aliasing construct to cover target embedding and the treatment of translucent objects is described.
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach

International Nuclear Information System (INIS)

Tian, Z; Shi, F; Jia, X; Jiang, S; Peng, F

2014-01-01

Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires access to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use
SU-E-T-395: Multi-GPU-Based VMAT Treatment Plan Optimization Using a Column-Generation Approach

Energy Technology Data Exchange (ETDEWEB)

Tian, Z; Shi, F; Jia, X; Jiang, S [UT Southwestern Medical Ctr at Dallas, Dallas, TX (United States); Peng, F [Carnegie Mellon University, Pittsburgh, PA (United States)

2014-06-01

Purpose: GPU has been employed to speed up VMAT optimizations from hours to minutes. However, its limited memory capacity makes it difficult to handle cases with a huge dose-deposition-coefficient (DDC) matrix, e.g. those with a large target size, multiple arcs, small beam angle intervals and/or small beamlet size. We propose multi-GPU-based VMAT optimization to solve this memory issue to make GPU-based VMAT more practical for clinical use. Methods: Our column-generation-based method generates apertures sequentially by iteratively searching for an optimal feasible aperture (referred as pricing problem, PP) and optimizing aperture intensities (referred as master problem, MP). The PP requires access to the large DDC matrix, which is implemented on a multi-GPU system. Each GPU stores a DDC sub-matrix corresponding to one fraction of beam angles and is only responsible for calculation related to those angles. Broadcast and parallel reduction schemes are adopted for inter-GPU data transfer. MP is a relatively small-scale problem and is implemented on one GPU. One headand- neck cancer case was used for test. Three different strategies for VMAT optimization on single GPU were also implemented for comparison: (S1) truncating DDC matrix to ignore its small value entries for optimization; (S2) transferring DDC matrix part by part to GPU during optimizations whenever needed; (S3) moving DDC matrix related calculation onto CPU. Results: Our multi-GPU-based implementation reaches a good plan within 1 minute. Although S1 was 10 seconds faster than our method, the obtained plan quality is worse. Both S2 and S3 handle the full DDC matrix and hence yield the same plan as in our method. However, the computation time is longer, namely 4 minutes and 30 minutes, respectively. Conclusion: Our multi-GPU-based VMAT optimization can effectively solve the limited memory issue with good plan quality and high efficiency, making GPUbased ultra-fast VMAT planning practical for real clinical use.
FastGCN: a GPU accelerated tool for fast gene co-expression networks.

Directory of Open Access Journals (Sweden)

Meimei Liang

Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Computing treewidth on the GPU

NARCIS (Netherlands)

van der Zanden, T.C.; Bodlaender, Hans L.

2017-01-01

We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an $O^*(2^{n})$-time algorithm that explores the elimination orderings of the graph using a Held-Karp like
High performance technique for database applicationsusing a hybrid GPU/CPU platform

KAUST Repository

Zidan, Mohammed A.

2012-07-28

Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applica- tions by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations. Copyright © 2011 by ASME.
Research on GPU acceleration for Monte Carlo criticality calculation

International Nuclear Information System (INIS)

Xu, Q.; Yu, G.; Wang, K.

2013-01-01

The Monte Carlo (MC) neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The 'neutron transport step' is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the 'neutron transport step' strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs. (authors)

Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU.

Science.gov (United States)

Arefan, D; Talebpour, A; Ahmadinejhad, N; Kamali Asl, A

2015-06-01

Digital Breast Tomosynthesis (DBT) is a technology that creates three dimensional (3D) images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU). At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU) card and the Graphics Processing Unit (GPU). It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU).
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU

Directory of Open Access Journals (Sweden)

Arefan D

2015-06-01

Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.
Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines

International Nuclear Information System (INIS)

Jia, Weile; Fu, Jiyun; Cao, Zongyan; Wang, Long; Chi, Xuebin; Gao, Weiguo; Wang, Lin-Wang

2013-01-01

Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most widely used method for material simulations, but its absolute speed stagnated due to the inability to use large scale CPU based computers. By a drastic redesign of the algorithm, and moving all the major computation parts into GPU, we have reached a speed of 12 s per molecular dynamics (MD) step for a 512 atom system using 256 GPU cards. This is about 20 times faster than the CPU version of the code regardless of the number of CPU cores used. Our tests and analysis on different GPU platforms and configurations shed lights on the optimal GPU deployments for PWP-DFT calculations. An 1800 step MD simulation is used to study the liquid phase properties of GaInP
77 FR 22383 - Petition for Exemption From the Federal Motor Vehicle Motor Theft Prevention Standard; TESLA

Science.gov (United States)

2012-04-13

... From the Federal Motor Vehicle Motor Theft Prevention Standard; TESLA AGENCY: National Highway Traffic... exemption. SUMMARY: This document grants in full the petition of Tesla Motors Inc's. (Tesla) for an... 49 CFR Part 541, Federal Motor Vehicle Theft Prevention Standard. Tesla requested confidential...
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations.

Science.gov (United States)

Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida

2014-05-01

Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
Design of an 18 Tesla, tandem mirror, fusion reactor, hybrid choke coil

International Nuclear Information System (INIS)

Parmer, J.F.; Agarwal, K.; Gurol, H.; Mancuso, A.; Michels, P.H.; Peck, S.D.; Burgeson, J.; Dalder, E.N.

1987-01-01

A hybrid, part normal part superconducting 18-Tesla solenoid choke coil is designed for a tandem mirror fusion reactor. The present state of the art is represented by the 12-Tesla, superconducting NbSn coil. Future applications other than tandem mirror fusion devices needing high field solenoids might require hybrid magnets of the type described herein. The hybrid design was generated because of critical field performance limitations on present, practical superconducting wires. A hybrid design might be required (due to structural limits) even if the critical field were higher. Also, hybrids could be a cost-effective way of getting very high fields for certain applications. The 18-Tesla solenoid described is composed of an inner coil made of water-cooled, high-strength zirconium copper which generates 3 Tesla. A superconducting NbSn background coil contributes the remaining 15 Tesla. The focus of the design study was on the inner coil. Demonstration fabrication and testing was performed
State-of-the-Art in GPU-Based Large-Scale Volume Visualization

KAUST Repository

Beyer, Johanna

2015-05-01

This survey gives an overview of the current state of the art in GPU techniques for interactive large-scale volume visualization. Modern techniques in this field have brought about a sea change in how interactive visualization and analysis of giga-, tera- and petabytes of volume data can be enabled on GPUs. In addition to combining the parallel processing power of GPUs with out-of-core methods and data streaming, a major enabler for interactivity is making both the computational and the visualization effort proportional to the amount and resolution of data that is actually visible on screen, i.e. \\'output-sensitive\\' algorithms and system designs. This leads to recent output-sensitive approaches that are \\'ray-guided\\', \\'visualization-driven\\' or \\'display-aware\\'. In this survey, we focus on these characteristics and propose a new categorization of GPU-based large-scale volume visualization techniques based on the notions of actual output-resolution visibility and the current working set of volume bricks-the current subset of data that is minimally required to produce an output image of the desired display resolution. Furthermore, we discuss the differences and similarities of different rendering and data traversal strategies in volume rendering by putting them into a common context-the notion of address translation. For our purposes here, we view parallel (distributed) visualization using clusters as an orthogonal set of techniques that we do not discuss in detail but that can be used in conjunction with what we present in this survey. © 2015 The Eurographics Association and John Wiley & Sons Ltd.
State-of-the-Art in GPU-Based Large-Scale Volume Visualization

KAUST Repository

Beyer, Johanna; Hadwiger, Markus; Pfister, Hanspeter

2015-01-01

This survey gives an overview of the current state of the art in GPU techniques for interactive large-scale volume visualization. Modern techniques in this field have brought about a sea change in how interactive visualization and analysis of giga-, tera- and petabytes of volume data can be enabled on GPUs. In addition to combining the parallel processing power of GPUs with out-of-core methods and data streaming, a major enabler for interactivity is making both the computational and the visualization effort proportional to the amount and resolution of data that is actually visible on screen, i.e. 'output-sensitive' algorithms and system designs. This leads to recent output-sensitive approaches that are 'ray-guided', 'visualization-driven' or 'display-aware'. In this survey, we focus on these characteristics and propose a new categorization of GPU-based large-scale volume visualization techniques based on the notions of actual output-resolution visibility and the current working set of volume bricks-the current subset of data that is minimally required to produce an output image of the desired display resolution. Furthermore, we discuss the differences and similarities of different rendering and data traversal strategies in volume rendering by putting them into a common context-the notion of address translation. For our purposes here, we view parallel (distributed) visualization using clusters as an orthogonal set of techniques that we do not discuss in detail but that can be used in conjunction with what we present in this survey. © 2015 The Eurographics Association and John Wiley & Sons Ltd.
Computing treewidth on the GPU

NARCIS (Netherlands)

Van Der Zanden, Tom C.; Bodlaender, Hans L.

2018-01-01

We present a parallel algorithm for computing the treewidth of a graph on a GPU. We implement this algorithm in OpenCL, and experimentally evaluate its performance. Our algorithm is based on an O∗(2n)-time algorithm that explores the elimination orderings of the graph using a Held-Karp like dynamic
Transportable GPU (General Processor Units) chip set technology for standard computer architectures

Science.gov (United States)

Fosdick, R. E.; Denison, H. C.

1982-11-01

The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.
Numerical simulation of lava flow using a GPU SPH model

Directory of Open Access Journals (Sweden)

Eugenio Rustico

2011-12-01

Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.
Imaging of patients with hippocampal sclerosis at 7 Tesla: initial results.

Science.gov (United States)

Breyer, Tobias; Wanke, Isabel; Maderwald, Stefan; Woermann, Friedrich G; Kraff, Oliver; Theysohn, Jens M; Ebner, Alois; Forsting, Michael; Ladd, Mark E; Schlamann, Marc

2010-04-01

Focal epilepsies potentially can be cured by neurosurgery; other treatment options usually remain symptomatic. High-resolution magnetic resonance (MR) imaging is the central imaging strategy in the evaluation of focal epilepsy. The most common substrate of temporal epilepsies is hippocampal sclerosis (HS), which cannot always be sufficiently characterized with current MR field strengths. Therefore, the purpose of our study was to demonstrate the feasibility of high-resolution MR imaging at 7 Tesla in patients with focal epilepsy resulting from a HS and to improve image resolution at 7 Tesla in patients with HS. Six patients with known HS were investigated with T1-, T2-, T2(*)-, and fluid-attenuated inversion recovery-weighted sequences at 7 Tesla with an eight-channel transmit-receive head coil. Total imaging time did not exceed 90 minutes per patient. High-resolution imaging at 7 Tesla is feasible and reveals high resolution of intrahippocampal structures in vivo. HS was confirmed in all patients. The maximum non-interpolated in-plane resolution reached 0.2 x 0.2 mm(2) in T2(*)-weighted images. The increased susceptibility effects at 7 Tesla revealed identification of intrahippocampal structures in more detail than at 1.5 Tesla, but otherwise led to stronger artifacts. Imaging revealed regional differences in hippocampal atrophy between patients. The scan volume was limited because of specific absorption rate restrictions, scanning time was reasonable. High-resolution imaging at 7 Tesla is promising in presurgical epilepsy imaging. "New" contrasts may further improve detection of even very small intrahippocampal structural changes. Therefore, further investigations will be necessary to demonstrate the potential benefit for presurgical selection of patients with various lesion patterns in mesial temporal epilepsies resulting from a unilateral HS. Copyright 2010 AUR. Published by Elsevier Inc. All rights reserved.
Parallel GPU implementation of iterative PCA algorithms.

Science.gov (United States)

Andrecut, M

2009-11-01

Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets, the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA), are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).
GPU seeks new funding for TMI cleanup

International Nuclear Information System (INIS)

Utroska, D.

1982-01-01

General Public Utilities (GPU) wants approval for annual transfer of money from base rate increases in other accounts to pay for the cleanup at Three Mile Island (TMI) until TMI-1 returns to service or the public utility commission takes further action. This proposal confirms fears of a delay in TMI-1 startup and demonstrates that the January negotiated settlement will produce little funding for TMI-2 cleanup. A review of the settlement terms outlines the three-step process for base rate increases and revenue adjustments after the startup of TMI-1, and points out where controversy and delays due to psychological stress make a new source of money essential. GPU thinks customer funding will motivate other parties to a broad-based cost-sharing agreement
GPU-Monte Carlo based fast IMRT plan optimization

Directory of Open Access Journals (Sweden)

Yongbao Li

2014-03-01

Full Text Available Purpose: Intensity-modulated radiation treatment (IMRT plan optimization needs pre-calculated beamlet dose distribution. Pencil-beam or superposition/convolution type algorithms are typically used because of high computation speed. However, inaccurate beamlet dose distributions, particularly in cases with high levels of inhomogeneity, may mislead optimization, hindering the resulting plan quality. It is desire to use Monte Carlo (MC methods for beamlet dose calculations. Yet, the long computational time from repeated dose calculations for a number of beamlets prevents this application. It is our objective to integrate a GPU-based MC dose engine in lung IMRT optimization using a novel two-steps workflow.Methods: A GPU-based MC code gDPM is used. Each particle is tagged with an index of a beamlet where the source particle is from. Deposit dose are stored separately for beamlets based on the index. Due to limited GPU memory size, a pyramid space is allocated for each beamlet, and dose outside the space is neglected. A two-steps optimization workflow is proposed for fast MC-based optimization. At first step, a rough dose calculation is conducted with only a few number of particle per beamlet. Plan optimization is followed to get an approximated fluence map. In the second step, more accurate beamlet doses are calculated, where sampled number of particles for a beamlet is proportional to the intensity determined previously. A second-round optimization is conducted, yielding the final result.Results: For a lung case with 5317 beamlets, 105 particles per beamlet in the first round, and 108 particles per beam in the second round are enough to get a good plan quality. The total simulation time is 96.4 sec.Conclusion: A fast GPU-based MC dose calculation method along with a novel two-step optimization workflow are developed. The high efficiency allows the use of MC for IMRT optimizations.--------------------------------Cite this article as: Li Y, Tian Z
The Superconducting TESLA Cavities

CERN Document Server

Aune, B.; Bloess, D.; Bonin, B.; Bosotti, A.; Champion, M.; Crawford, C.; Deppe, G.; Dwersteg, B.; Edwards, D.A.; Edwards, H.T.; Ferrario, M.; Fouaidy, M.; Gall, P-D.; Gamp, A.; Gössel, A.; Graber, J.; Hubert, D.; Hüning, M.; Juillard, M.; Junquera, T.; Kaiser, H.; Kreps, G.; Kuchnir, M.; Lange, R.; Leenen, M.; Liepe, M.; Lilje, L.; Matheisen, A.; Möller, W-D.; Mosnier, A.; Padamsee, H.; Pagani, C.; Pekeler, M.; Peters, H-B.; Peters, O.; Proch, D.; Rehlich, K.; Reschke, D.; Safa, H.; Schilcher, T.; Schmüser, P.; Sekutowicz, J.; Simrock, S.; Singer, W.; Tigner, M.; Trines, D.; Twarowski, K.; Weichert, G.; Weisend, J.; Wojtkiewicz, J.; Wolff, S.; Zapfe, K.

2000-01-01

The conceptional design of the proposed linear electron-positron colliderTESLA is based on 9-cell 1.3 GHz superconducting niobium cavities with anaccelerating gradient of Eacc >= 25 MV/m at a quality factor Q0 > 5E+9. Thedesign goal for the cavities of the TESLA Test Facility (TTF) linac was set tothe more moderate value of Eacc >= 15 MV/m. In a first series of 27industrially produced TTF cavities the average gradient at Q0 = 5E+9 wasmeasured to be 20.1 +- 6.2 MV/m, excluding a few cavities suffering fromserious fabrication or material defects. In the second production of 24 TTFcavities additional quality control measures were introduced, in particular aneddy-current scan to eliminate niobium sheets with foreign material inclusionsand stringent prescriptions for carrying out the electron-beam welds. Theaverage gradient of these cavities at Q0 = 5E+9 amounts to 25.0 +- 3.2 MV/mwith the exception of one cavity suffering from a weld defect. Hence only amoderate improvement in production and preparation technique...
GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

International Nuclear Information System (INIS)

Ammazzalorso, F; Jelen, U; Bednarz, T

2014-01-01

We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
GPU-accelerated automatic identification of robust beam setups for proton and carbon-ion radiotherapy

Science.gov (United States)

Ammazzalorso, F.; Bednarz, T.; Jelen, U.

2014-03-01

We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing

OpenAIRE

YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo

2009-01-01

[ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...
GPU-accelerated 3D neutron diffusion code based on finite difference method

Energy Technology Data Exchange (ETDEWEB)

Xu, Q.; Yu, G.; Wang, K. [Dept. of Engineering Physics, Tsinghua Univ. (China)

2012-07-01

Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)

GPU-accelerated 3D neutron diffusion code based on finite difference method

International Nuclear Information System (INIS)

Xu, Q.; Yu, G.; Wang, K.

2012-01-01

Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)
GPU-accelerated adjoint algorithmic differentiation

Science.gov (United States)

Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe

2016-03-01

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Using GPU to calculate electron dose for hybrid pencil beam model

International Nuclear Information System (INIS)

Guo Chengjun; Li Xia; Hou Qing; Wu Zhangwen

2011-01-01

Hybrid pencil beam model (HPBM) offers an efficient approach to calculate the three-dimension dose distribution from a clinical electron beam. Still, clinical radiation treatment activity desires faster treatment plan process. Our work presented the fast implementation of HPBM-based electron dose calculation using graphics processing unit (GPU). The HPBM algorithm was implemented in compute unified device architecture running on the GPU, and C running on the CPU, respectively. Several tests with various sizes of the field, beamlet and voxel were used to evaluate our implementation. On an NVIDIA GeForce GTX470 GPU card, we achieved speedup factors of 2.18- 98.23 with acceptable accuracy, compared with the results from a Pentium E5500 2.80 GHz Dual-core CPU. (authors)
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system

Energy Technology Data Exchange (ETDEWEB)

Ma, Jiasen, E-mail: ma.jiasen@mayo.edu; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G. [Department of Radiation Oncology, Division of Medical Physics, Mayo Clinic, 200 First Street Southwest, Rochester, Minnesota 55905 (United States)

2014-12-15

Purpose: Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. Methods: An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. Results: For relatively large and complex three-field head and neck cases, i.e., >100 000 spots with a target volume of ∼1000 cm{sup 3} and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. Conclusions: A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45�
A GPU-accelerated and Monte Carlo-based intensity modulated proton therapy optimization system.

Science.gov (United States)

Ma, Jiasen; Beltran, Chris; Seum Wan Chan Tseung, Hok; Herman, Michael G

2014-12-01

Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. An in-house graphics processing unit (GPU)-based MC dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. For relatively large and complex three-field head and neck cases, i.e., >100,000 spots with a target volume of ∼ 1000 cm(3) and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45,000 dollars. The fast calculation and
Risk management at GPU Nuclear

International Nuclear Information System (INIS)

Long, R.L.

1991-01-01

This paper reports on GPU Nuclear. Among other goals, it established the independence of key safety functions as highlighted by the lessons learned from the accident. In particular, an independent Nuclear Assurance Division was established which include Quality Assurance, Training and Education, Emergency Preparedness, and Nuclear Safety Assessment. The latter consisted of corporate and site independent-safety-review groups. As the GPU Nuclear organization matured, a mid-1987 reorganization created an even more focused Planning and Nuclear Safety Division bringing together Nuclear Safety Assessment with Licensing and Regulatory Affairs and Risk Management. The Risk Management Group (RMG), which began its work in fall 1987, was formed to develop a framework for proactive identification, evaluation, and cost-effective reduction and management of risks of all types. The RMG set out to learn as much as possible about risks and their management in nuclear and other high-technology industries. This began with a thorough literature search. It progressed to interviews with individuals and organizations which have demonstrated innovative ideas, experience, and reputations for safe and reliable operation
Optic Nerve Assessment Using 7-Tesla Magnetic Resonance Imaging.

Science.gov (United States)

Singh, Arun D; Platt, Sean M; Lystad, Lisa; Lowe, Mark; Oh, Sehong; Jones, Stephen E; Alzahrani, Yahya; Plesec, Thomas

2016-04-01

The purpose of this study was to correlate high-resolution magnetic resonance imaging (MRI) and histologic findings in a case of juxtapapillary choroidal melanoma with clinical evidence of optic nerve invasion. With institutional review board approval, an enucleated globe with choroidal melanoma and optic nerve invasion was imaged using a 7-tesla MRI followed by histopathologic evaluation. Optical coherence tomography, B-scan ultrasonography, and 1.5-tesla MRI of the orbit (1-mm sections) could not detect optic disc invasion. Ex vivo, 7-tesla MRI detected optic nerve invasion, which correlated with histopathologic features. Our case demonstrates the potential to document the existence of optic nerve invasion in the presence of an intraocular tumor, a feature that has a major bearing on decision making, particularly for consideration of enucleation.
[Examination of upper abdominal region in high spatial resolution diffusion-weighted imaging using 3-Tesla MRI].

Science.gov (United States)

Terada, Masaki; Matsushita, Hiroki; Oosugi, Masanori; Inoue, Kazuyasu; Yaegashi, Taku; Anma, Takeshi

2009-03-20

The advantage of the higher signal-to-noise ratio (SNR) of 3-Tesla magnetic resonance imaging (3-Tesla) has the possibility of contributing to the improvement of high spatial resolution without causing image deterioration. In this study, we compared SNR and the apparent diffusion coefficient (ADC) value with 3-Tesla as the condition in the diffusion-weighted image (DWI) parameter of the 1.5-Tesla magnetic resonance imaging (1.5-Tesla) and we examined the high spatial resolution images in the imaging method [respiratory-triggering (RT) method and breath free (BF) method] and artifact (motion and zebra) in the upper abdominal region of DWI at 3-Tesla. We have optimized scan parameters based on phantom and in vivo study. As a result, 3-Tesla was able to obtain about 1.5 times SNR in comparison with the 1.5-Tesla, ADC value had few differences. Moreover, the RT method was effective in correcting the influence of respiratory movement in comparison with the BF method, and image improvement by the effective acquisition of SNR and reduction of the artifact were provided. Thus, DWI of upper abdominal region was a useful sequence for the high spatial resolution in 3-Tesla.
Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture.

Science.gov (United States)

Sharma, Rahul; Gupta, Nitin; Narang, Vipin; Mittal, Ankush

2011-01-01

Positional Weight Matrices (PWMs) are widely used in representation and detection of Transcription Factor Of Binding Sites (TFBSs) on DNA. We implement online PWM search algorithm over parallel architecture. A large PWM data can be processed on Graphic Processing Unit (GPU) systems in parallel which can help in matching sequences at a faster rate. Our method employs extensive usage of highly multithreaded architecture and shared memory of multi-cored GPU. An efficient use of shared memory is required to optimise parallel reduction in CUDA. Our optimised method has a speedup of 230-280x over linear implementation on GPU named GeForce GTX 280.
Nikola Tesla Educational Opportunity School.

Science.gov (United States)

Design Cost Data, 2001

2001-01-01

Describes the architectural design, costs, general description, and square footage data for the Nikola Tesla Educational Opportunity School in Colorado Springs, Colorado. A floor plan and photos are included along with a list of manufacturers and suppliers used for the project. (GR)
MRI of the carotid artery at 7 Tesla: Quantitative comparison with 3 Tesla

NARCIS (Netherlands)

Koning, Wouter; De Rotte, Alexandra A J; Bluemink, Johanna J.; Van Der Velden, Tijl A.; Luijten, Peter R.; Klomp, DWJ; Zwanenburg, Jaco J M

2015-01-01

Purpose: To evaluate the 7 Tesla (T) MRI of the carotid arteries, as quantitatively compared with 3T. Materials and Methods: The 7T MRI of the carotid arteries was performed in six healthy subjects and in two patients with carotid stenosis. The healthy group was scanned at 3T and at 7T, using
76 FR 47639 - Tesla Motors, Inc.; Receipt of Petition for Temporary Exemption From the Electronic Stability...

Science.gov (United States)

2011-08-05

...-0110] Tesla Motors, Inc.; Receipt of Petition for Temporary Exemption From the Electronic Stability... accordance with the procedures in 49 CFR part 555, Tesla Motors, Inc., has petitioned the agency for a... part 555, Tesla Motors, Inc. (Tesla) submitted a petition dated June 7, 2001 asking the agency for a...
Using 3 Tesla magnetic resonance imaging in the pre-operative evaluation of tongue carcinoma.

Science.gov (United States)

Moreno, K F; Cornelius, R S; Lucas, F V; Meinzen-Derr, J; Patil, Y J

2017-09-01

This study aimed to evaluate the role of 3 Tesla magnetic resonance imaging in predicting tongue tumour thickness via direct and reconstructed measures, and their correlations with corresponding histological measures, nodal metastasis and extracapsular spread. A prospective study was conducted of 25 patients with histologically proven squamous cell carcinoma of the tongue and pre-operative 3 Tesla magnetic resonance imaging from 2009 to 2012. Correlations between 3 Tesla magnetic resonance imaging and histological measures of tongue tumour thickness were assessed using the Pearson correlation coefficient: r values were 0.84 (p Tesla magnetic resonance imaging had 83 per cent sensitivity, 82 per cent specificity, 82 per cent accuracy and a 90 per cent negative predictive value for detecting cervical lymph node metastasis. In this cohort, 3 Tesla magnetic resonance imaging measures of tumour thickness correlated highly with the corresponding histological measures. Further, 3 Tesla magnetic resonance imaging was an effective method of detecting malignant adenopathy with extracapsular spread.
GPU-based cone beam computed tomography.

Science.gov (United States)

Noël, Peter B; Walczak, Alan M; Xu, Jinhui; Corso, Jason J; Hoffmann, Kenneth R; Schafer, Sebastian

2010-06-01

The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 s). In many situations, the short scanning time of CBCT is followed by a time-consuming 3D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 256(3) takes up to 25 min on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high-performance computing solutions at a low cost, allowing their use in many scientific problems. We have implemented an algorithm for 3D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Corporation, Santa Clara, California), which was executed on a NVIDIA GeForce GTX 280. Our implementation results in improved reconstruction times from minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe if differences occur between CPU and GPU-based reconstructions. By using our approach, the computation time for 256(3) is reduced from 25 min on the CPU to 3.2 s on the GPU. The GPU reconstruction time for 512(3) volumes is 8.5 s. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
Study of the acceleration of nuclide burnup calculation using GPU with CUDA

International Nuclear Information System (INIS)

Okui, S.; Ohoka, Y.; Tatsumi, M.

2009-01-01

The computation costs of neutronics calculation code become higher as physics models and methods are complicated. The degree of them in neutronics calculation tends to be limited due to available computing power. In order to open a door to the new world, use of GPU for general purpose computing, called GPGPU, has been studied [1]. GPU has multi-threads computing mechanism enabled with multi-processors which realize mush higher performance than CPUs. NVIDIA recently released the CUDA language for general purpose computation which is a C-like programming language. It is relatively easy to learn compared to the conventional ones used for GPGPU, such as OpenGL or CG. Therefore application of GPU to the numerical calculation became much easier. In this paper, we tried to accelerate nuclide burnup calculation, which is important to predict nuclides time dependence in the core, using GPU with CUDA. We chose the 4.-order Runge-Kutta method to solve the nuclide burnup equation. The nuclide burnup calculation and the 4.-order Runge-Kutta method were suitable to the first step of introduction CUDA into numerical calculation because these consist of simple operations of matrices and vectors of single precision where actual codes were written in the C++ language. Our experimental results showed that nuclide burnup calculations with GPU have possibility of speedup by factor of 100 compared to that with CPU. (authors)
Bayer image parallel decoding based on GPU

Science.gov (United States)

Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

2012-11-01

In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
permGPU: Using graphics processing units in RNA microarray association studies

Directory of Open Access Journals (Sweden)

George Stephen L

2010-06-01

Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Visual Media Reasoning - Terrain-based Geolocation

Science.gov (United States)

2015-06-01

the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
Medical image processing on the GPU - past, present and future.

Science.gov (United States)

Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M

2013-12-01

Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges. Copyright © 2013 Elsevier B.V. All rights reserved.
Two-Layer 16 Tesla Cosθ Dipole Design for the FCC

Energy Technology Data Exchange (ETDEWEB)

Holik, Eddie Frank [Fermilab; Ambrosio, Giorgio [Fermilab; Apollinari, G. [Fermilab

2018-02-13

The Future Circular Collider or FCC is a study aimed at exploring the possibility to reach 100 TeV total collision energy which would require 16 tesla dipoles. Upon the conclusion of the High Luminosity Upgrade, the US LHC Accelerator Upgrade Pro-ject in collaboration with CERN will have extensive Nb3Sn magnet fabrication experience. This experience includes robust Nb3Sn conductor and insulation scheming, 2-layer cos2θ coil fabrication, and bladder-and-key structure and assembly. By making im-provements and modification to existing technology the feasibility of a two-layer 16 tesla dipole is investigated. Preliminary designs indicate that fields up to 16.6 tesla are feasible with conductor grading while satisfying the HE-LHC and FCC specifications. Key challenges include accommodating high-aspect ratio conductor, narrow wedge design, Nb3Sn conductor grading, and especially quench protection of a 16 tesla device.

Synthetic Aperture Beamformation using the GPU

DEFF Research Database (Denmark)

Hansen, Jens Munk; Schaa, Dana; Jensen, Jørgen Arendt

2011-01-01

A synthetic aperture ultrasound beamformer is implemented for a GPU using the OpenCL framework. The implementation supports beamformation of either RF signals or complex baseband signals. Transmit and receive apodization can be either parametric or dynamic using a fixed F-number, a reference...
GPU-Boosted Camera-Only Indoor Localization

DEFF Research Database (Denmark)

Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup

relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...
CMFD and GPU acceleration on method of characteristics for hexagonal cores

International Nuclear Information System (INIS)

Han, Yu; Jiang, Xiaofeng; Wang, Dezhong

2014-01-01

Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times
CMFD and GPU acceleration on method of characteristics for hexagonal cores

Energy Technology Data Exchange (ETDEWEB)

Han, Yu, E-mail: hanyu1203@gmail.com [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China); Jiang, Xiaofeng [Shanghai NuStar Nuclear Power Technology Co., Ltd., No. 81 South Qinzhou Road, XuJiaHui District, Shanghai 200000 (China); Wang, Dezhong [School of Nuclear Science and Engineering, Shanghai Jiaotong University, Shanghai 200240 (China)

2014-12-15

Highlights: • A merged hex-mesh CMFD method solved via tri-diagonal matrix inversion. • Alternative hardware acceleration of using inexpensive GPU. • A hex-core benchmark with solution to confirm two acceleration methods. - Abstract: Coarse Mesh Finite Difference (CMFD) has been widely adopted as an effective way to accelerate the source iteration of transport calculation. However in a core with hexagonal assemblies there are non-hexagonal meshes around the edges of assemblies, causing a problem for CMFD if the CMFD equations are still to be solved via tri-diagonal matrix inversion by simply scanning the whole core meshes in different directions. To solve this problem, we propose an unequal mesh CMFD formulation that combines the non-hexagonal cells on the boundary of neighboring assemblies into non-regular hexagonal cells. We also investigated the alternative hardware acceleration of using graphics processing units (GPU) with graphics card in a personal computer. The tool CUDA is employed, which is a parallel computing platform and programming model invented by the company NVIDIA for harnessing the power of GPU. To investigate and implement these two acceleration methods, a 2-D hexagonal core transport code using the method of characteristics (MOC) is developed. A hexagonal mini-core benchmark problem is established to confirm the accuracy of the MOC code and to assess the effectiveness of CMFD and GPU parallel acceleration. For this benchmark problem, the CMFD acceleration increases the speed 16 times while the GPU acceleration speeds it up 25 times. When used simultaneously, they provide a speed gain of 292 times.
MRI safety of a programmable shunt assistant at 3 and 7 Tesla.

Science.gov (United States)

Mirzayan, M Javad; Klinge, Petra M; Samii, Madjid; Goetz, Friedrich; Krauss, Joachim K

2012-06-01

Several new shunt technologies have been developed to optimize hydrocephalus treatment within the past few years. Overdrainage, however, still remains an unresolved problem. One new technology which may reduce the frequency of this complication is the use of a programmable shunt assistant (proSA). Inactive in a horizontal position, it impedes CSF flow in a vertical position according to a prescribed pressure level ranging from 0 to 40 cm H(2)O. We exposed the proSA valve in an ex vivo protocol to MR systems operating at 3 and 7 Tesla to investigate its MRI safety. Following 3 Tesla exposure, no changes in valve settings were noted. Adjustment to any pressure level was possible thereafter. The mean deflection angle was 23 ± 3°. After exposure to 7 Tesla, however, there were unintended pressure changes, and the mechanism for further adjustment of the valves even disintegrated. According to the results of this study, proSA is safe with heteropolar vertical magnet alignment at 3 Tesla. Following 7 Tesla exposure, the valves lost their functional capability.
High-speed optical coherence tomography signal processing on GPU

International Nuclear Information System (INIS)

Li Xiqi; Shi Guohua; Zhang Yudong

2011-01-01

The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).
A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems

Energy Technology Data Exchange (ETDEWEB)

Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung [Seoul National University, Seoul (Korea, Republic of)

2009-10-15

The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries
A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems

International Nuclear Information System (INIS)

Ha, Woo Seok; Kim, Soo Mee; Park, Min Jae; Lee, Dong Soo; Lee, Jae Sung

2009-01-01

The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 sec, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 sec, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Energy Technology Data Exchange (ETDEWEB)

Mishra, Alok [Stony Brook Univ., Stony Brook, NY (United States); Li, Lingda [Brookhaven National Lab. (BNL), Upton, NY (United States); Kong, Martin [Brookhaven National Lab. (BNL), Upton, NY (United States); Finkel, Hal [Argonne National Lab. (ANL), Argonne, IL (United States); Chapman, Barbara [Stony Brook Univ., Stony Brook, NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)

2017-01-01

Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.
Integrative multicellular biological modeling: a case study of 3D epidermal development using GPU algorithms

Directory of Open Access Journals (Sweden)

Christley Scott

2010-08-01

Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a
Calculation of secondary capacitance of compact Tesla pulse transformer

International Nuclear Information System (INIS)

Yu Binxiong; Liu Jinliang

2013-01-01

An analytic expression of the secondary capacitance of a compact Tesla pulse transformer is derived. Calculated result by the expression shows that two parts contribute to the secondary capacitance, namely the capacitance between inner and outer magnetic cores and the attached capacitance caused by the secondary winding. The attached capacitance equals to the capacitance of a coaxial line which is as long as the secondary coil, and whose outer and inner diameters are as large as the inner diameter of the outer magnetic and the outer diameter of the inner magnetic core respectively. A circuital model for analyzing compact Tesla transformer is built, and numeric calculation shows that the expression of the secondary capacitance is correct. Besides, a small compact Tesla transformer is developed, and related test is carried out. Test result confirms the calculated results by the expression derived. (authors)
Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection

Directory of Open Access Journals (Sweden)

Yi-Shan Lin

2017-01-01

Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.
High-throughput GPU-based LDPC decoding

Science.gov (United States)

Chang, Yang-Lang; Chang, Cheng-Chun; Huang, Min-Yu; Huang, Bormin

2010-08-01

Low-density parity-check (LDPC) code is a linear block code known to approach the Shannon limit via the iterative sum-product algorithm. LDPC codes have been adopted in most current communication systems such as DVB-S2, WiMAX, WI-FI and 10GBASE-T. LDPC for the needs of reliable and flexible communication links for a wide variety of communication standards and configurations have inspired the demand for high-performance and flexibility computing. Accordingly, finding a fast and reconfigurable developing platform for designing the high-throughput LDPC decoder has become important especially for rapidly changing communication standards and configurations. In this paper, a new graphic-processing-unit (GPU) LDPC decoding platform with the asynchronous data transfer is proposed to realize this practical implementation. Experimental results showed that the proposed GPU-based decoder achieved 271x speedup compared to its CPU-based counterpart. It can serve as a high-throughput LDPC decoder.
Aerodynamic optimization of supersonic compressor cascade using differential evolution on GPU

Energy Technology Data Exchange (ETDEWEB)

Aissa, Mohamed Hasanine; Verstraete, Tom [Von Karman Institute for Fluid Dynamics (VKI) 1640 Sint-Genesius-Rode (Belgium); Vuik, Cornelis [Delft University of Technology 2628 CD Delft (Netherlands)

2016-06-08

Differential Evolution (DE) is a powerful stochastic optimization method. Compared to gradient-based algorithms, DE is able to avoid local minima but requires at the same time more function evaluations. In turbomachinery applications, function evaluations are performed with time-consuming CFD simulation, which results in a long, non affordable, design cycle. Modern High Performance Computing systems, especially Graphic Processing Units (GPUs), are able to alleviate this inconvenience by accelerating the design evaluation itself. In this work we present a validated CFD Solver running on GPUs, able to accelerate the design evaluation and thus the entire design process. An achieved speedup of 20x to 30x enabled the DE algorithm to run on a high-end computer instead of a costly large cluster. The GPU-enhanced DE was used to optimize the aerodynamics of a supersonic compressor cascade, achieving an aerodynamic loss minimization of 20%.
TU-FG-BRB-07: GPU-Based Prompt Gamma Ray Imaging From Boron Neutron Capture Therapy

Energy Technology Data Exchange (ETDEWEB)

Kim, S; Suh, T; Yoon, D; Jung, J; Shin, H; Kim, M [The catholic university of Korea, Seoul (Korea, Republic of)

2016-06-15

Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusion: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray reconstruction using the GPU computation for BNCT simulations.
77 FR 2269 - Foreign-Trade Zone 18-San Jose, CA, Application for Subzone, Tesla Motors, Inc. (Electric...

Science.gov (United States)

2012-01-17

..., CA, Application for Subzone, Tesla Motors, Inc. (Electric Passenger Vehicles), Palo Alto and Fremont... passenger- vehicle manufacturing facilities of Tesla Motors, Inc. (Tesla), located in Palo Alto and Fremont... January 10, 2012. The Tesla facilities (currently employing over 1,000 workers) consist of two sites: Site...
Accelerating Pseudo-Random Number Generator for MCNP on GPU

Science.gov (United States)

Gong, Chunye; Liu, Jie; Chi, Lihua; Hu, Qingfeng; Deng, Li; Gong, Zhenghu

2010-09-01

Pseudo-random number generators (PRNG) are intensively used in many stochastic algorithms in particle simulations, artificial neural networks and other scientific computation. The PRNG in Monte Carlo N-Particle Transport Code (MCNP) requires long period, high quality, flexible jump and fast enough. In this paper, we implement such a PRNG for MCNP on NVIDIA's GTX200 Graphics Processor Units (GPU) using CUDA programming model. Results shows that 3.80 to 8.10 times speedup are achieved compared with 4 to 6 cores CPUs and more than 679.18 million double precision random numbers can be generated per second on GPU.
Cervical external immobilization devices: evaluation of magnetic resonance imaging issues at 3.0 Tesla.

Science.gov (United States)

Diaz, Francis L; Tweardy, Lisa; Shellock, Frank G

2010-02-15

Laboratory investigation, ex vivo. Currently, no studies have addressed the magnetic resonance imaging (MRI) issues for cervical external immobilization devices at 3-Tesla. Under certain conditions significant heating may occur, resulting in patient burns. Furthermore, artifacts can be substantial and prevent the diagnostic use of MRI. Therefore, the objective of this investigation was to evaluate MRI issues for 4 different cervical external immobilization devices at 3-Tesla. Excessive heating and substantial artifacts are 2 potential complications associated with performing MRI at 3-Tesla in patients with cervical external immobilization devices. Using ex vivo testing techniques, MRI-related heating and artifacts were evaluated for 4 different cervical devices during MRI at 3-Tesla. Four cervical external immobilization devices (Generation 80, Resolve Ring and Superstructure, Resolve Ring and Jerome Vest/Jerome Superstructure, and the V1 Halo System; Ossur Americas, Aliso Viejo, CA) underwent MRI testing at 3-Tesla. All devices were made from nonmetallic or nonmagnetic materials. Heating was determined using a gelled-saline-filled skull phantom with fluoroptic thermometry probes attached to the skull pins. MRI was performed at 3-Tesla, using a high level of RF energy. Artifacts were assessed at 3-Tesla, using standard cervical imaging techniques. The Generation 80 and V1 Halo devices exhibited substantial temperature rises (11.6 degrees C and 8.5 degrees C, respectively), with "sparking" evident for the Generation 80 during the MRI procedure. Artifacts were problematic for these devices, as well. By comparison, the 2 Resolve Ring-based cervical external immobilization devices showed little or no heating (Tesla.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.

Science.gov (United States)

Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H

2012-09-01

Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
The Legacy of Nikola Tesla

Indian Academy of Sciences (India)

Home; Journals; Resonance – Journal of Science Education; Volume 12; Issue 4. The Legacy of Nikola Tesla - AC Power System and its Growth in India. D P Sen Gupta. General Article Volume 12 Issue 4 April 2007 pp 69-79. Fulltext. Click here to view fulltext PDF. Permanent link:

Clinical implementation of a GPU-based simplified Monte Carlo method for a treatment planning system of proton beam therapy

International Nuclear Information System (INIS)

Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T

2011-01-01

We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30–16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9–67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning. (note)
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

Directory of Open Access Journals (Sweden)

Hamed Kargaran

2016-04-01

Full Text Available The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran

Energy Technology Data Exchange (ETDEWEB)

Kargaran, Hamed, E-mail: h-kargaran@sbu.ac.ir; Minuchehr, Abdolhamid; Zolfaghari, Ahmad [Department of nuclear engineering, Shahid Behesti University, Tehran, 1983969411 (Iran, Islamic Republic of)

2016-04-15

The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL-MODE and SHARED-MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL-MODE and SHARED-MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
GPU-accelerated Kernel Regression Reconstruction for Freehand 3D Ultrasound Imaging.

Science.gov (United States)

Wen, Tiexiang; Li, Ling; Zhu, Qingsong; Qin, Wenjian; Gu, Jia; Yang, Feng; Xie, Yaoqin

2017-07-01

Volume reconstruction method plays an important role in improving reconstructed volumetric image quality for freehand three-dimensional (3D) ultrasound imaging. By utilizing the capability of programmable graphics processing unit (GPU), we can achieve a real-time incremental volume reconstruction at a speed of 25-50 frames per second (fps). After incremental reconstruction and visualization, hole-filling is performed on GPU to fill remaining empty voxels. However, traditional pixel nearest neighbor-based hole-filling fails to reconstruct volume with high image quality. On the contrary, the kernel regression provides an accurate volume reconstruction method for 3D ultrasound imaging but with the cost of heavy computational complexity. In this paper, a GPU-based fast kernel regression method is proposed for high-quality volume after the incremental reconstruction of freehand ultrasound. The experimental results show that improved image quality for speckle reduction and details preservation can be obtained with the parameter setting of kernel window size of [Formula: see text] and kernel bandwidth of 1.0. The computational performance of the proposed GPU-based method can be over 200 times faster than that on central processing unit (CPU), and the volume with size of 50 million voxels in our experiment can be reconstructed within 10 seconds.
Performance estimation of Tesla turbine applied in small scale Organic Rankine Cycle (ORC) system

International Nuclear Information System (INIS)

Song, Jian; Gu, Chun-wei; Li, Xue-song

2017-01-01

Highlights: • One-dimensional model of the Tesla turbine is improved and applied in ORC system. • Working fluid properties and system operating conditions impact efficiency. • The influence of turbine efficiency on ORC system performance is evaluated. • Potential of using Tesla turbine in ORC systems is estimated. - Abstract: Organic Rankine Cycle (ORC) system has been proven to be an effective method for the low grade energy utilization. In small scale applications, the Tesla turbine offers an attractive option for the organic expander if an efficient design can be achieved. The Tesla turbine is simple in structure and is easy to be manufactured. This paper improves the one-dimensional model for the Tesla turbine, which adopts a non-dimensional formulation that identifies the dimensionless parameters that dictates the performance features of the turbine. The model is used to predict the efficiency of a Tesla turbine that is applied in a small scale ORC system. The influence of the working fluid properties and the operating conditions on the turbine performance is evaluated. Thermodynamic analysis of the ORC system with different organic working fluids and under various operating conditions is conducted. The simulation results reveal that the ORC system can generate a considerable net power output. Therefore, the Tesla turbine can be regarded as a potential choice to be applied in small scale ORC systems.
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

OpenAIRE

Wang, Linnan; Ye, Jinmian; Zhao, Yiyang; Wu, Wei; Li, Ang; Song, Shuaiwen Leon; Xu, Zenglin; Kraska, Tim

2018-01-01

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far be...
APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

International Nuclear Information System (INIS)

Ammendola, R; Salamon, A; Salina, G; Biagioni, A; Prezza, O; Cicero, F Lo; Lonardo, A; Paolucci, P S; Rossetti, D; Tosoratto, L; Vicini, P; Simula, F

2011-01-01

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera ® FPGA, are provided.
APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Energy Technology Data Exchange (ETDEWEB)

Ammendola, R; Salamon, A; Salina, G [INFN Tor Vergata, Roma (Italy); Biagioni, A; Prezza, O; Cicero, F Lo; Lonardo, A; Paolucci, P S; Rossetti, D; Tosoratto, L; Vicini, P [INFN Roma, Roma (Italy); Simula, F [Sapienza Universita di Roma, Roma (Italy)

2011-12-23

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera{sup Registered-Sign} FPGA, are provided.
The Legacy of Nikola Tesla

Indian Academy of Sciences (India)

Home; Journals; Resonance – Journal of Science Education; Volume 12; Issue 3. The Legacy of Nikola Tesla - The AC System that he Helped to Usher in. D P Sen Gupta. General Article Volume 12 Issue 3 March 2007 pp 54-69. Fulltext. Click here to view fulltext PDF. Permanent link:
The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography.

Science.gov (United States)

Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie

2010-09-13

In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

International Nuclear Information System (INIS)

Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.

2014-01-01

We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

Energy Technology Data Exchange (ETDEWEB)

Kylasa, S.B., E-mail: skylasa@purdue.edu [Department of Elec. and Comp. Eng., Purdue University, West Lafayette, IN 47907 (United States); Aktulga, H.M., E-mail: hmaktulga@lbl.gov [Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, MS 50F-1650, Berkeley, CA 94720 (United States); Grama, A.Y., E-mail: ayg@cs.purdue.edu [Department of Computer Science, Purdue University, West Lafayette, IN 47907 (United States)

2014-09-01

We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.
CUDA GPU based full-Stokes finite difference modelling of glaciers

DEFF Research Database (Denmark)

Brædstrup, Christian; Egholm, D.L.

advances in graphics card (GPU) technology for high performance computing have proven extremely efficient in accelerating many large scale scientific com- putations. The general purpose GPU (GPGPU) technology is cheap, has a low power consumption and fits into a normal desktop computer. It could therefore...... to minimize the short wavelength errors efficiently. This reduces the iteration count by several orders of magnitude. The run-time is further reduced by using the GPGPU technology where each card has up to 448 cores. Researchers utilizing the GPGPU technique in other areas have reported between 2 - 11 times...
A Laminar Flow-Based Microfluidic Tesla Pump via Lithography Enabled 3D Printing

Directory of Open Access Journals (Sweden)

Mohammed-Baker Habhab

2016-11-01

Full Text Available Tesla turbine and its applications in power generation and fluid flow were demonstrated by Nicholas Tesla in 1913. However, its real-world implementations were limited by the difficulty to maintain laminar flow between rotor disks, transient efficiencies during rotor acceleration, and the lack of other applications that fully utilize the continuous flow outputs. All of the aforementioned limits of Tesla turbines can be addressed by scaling to the microfluidic flow regime. Demonstrated here is a microscale Tesla pump designed and fabricated using a Digital Light Processing (DLP based 3D printer with 43 µm lateral and 30 µm thickness resolutions. The miniaturized pump is characterized by low Reynolds number of 1000 and a flow rate of up to 12.6 mL/min at 1200 rpm, unloaded. It is capable of driving a mixer network to generate microfluidic gradient. The continuous, laminar flow from Tesla turbines is well-suited to the needs of flow-sensitive microfluidics, where the integrated pump will enable numerous compact lab-on-a-chip applications.
A Laminar Flow-Based Microfluidic Tesla Pump via Lithography Enabled 3D Printing.

Science.gov (United States)

Habhab, Mohammed-Baker; Ismail, Tania; Lo, Joe Fujiou

2016-11-23

Tesla turbine and its applications in power generation and fluid flow were demonstrated by Nicholas Tesla in 1913. However, its real-world implementations were limited by the difficulty to maintain laminar flow between rotor disks, transient efficiencies during rotor acceleration, and the lack of other applications that fully utilize the continuous flow outputs. All of the aforementioned limits of Tesla turbines can be addressed by scaling to the microfluidic flow regime. Demonstrated here is a microscale Tesla pump designed and fabricated using a Digital Light Processing (DLP) based 3D printer with 43 µm lateral and 30 µm thickness resolutions. The miniaturized pump is characterized by low Reynolds number of 1000 and a flow rate of up to 12.6 mL/min at 1200 rpm, unloaded. It is capable of driving a mixer network to generate microfluidic gradient. The continuous, laminar flow from Tesla turbines is well-suited to the needs of flow-sensitive microfluidics, where the integrated pump will enable numerous compact lab-on-a-chip applications.
An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU

International Nuclear Information System (INIS)

Yoon, Jong Seon; Choi, Hyoung Gwon; Jeon, Byoung Jin

2017-01-01

The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.
An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU

Energy Technology Data Exchange (ETDEWEB)

Yoon, Jong Seon; Choi, Hyoung Gwon [Seoul Nat’l Univ. of Science and Technology, Seoul (Korea, Republic of); Jeon, Byoung Jin [Yonsei Univ., Seoul (Korea, Republic of)

2017-02-15

The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.
Compact multimode fiber beam-shaping system based on GPU accelerated digital holography.

Science.gov (United States)

Plöschner, Martin; Čižmár, Tomáš

2015-01-15

Real-time, on-demand, beam shaping at the end of the multimode fiber has recently been made possible by exploiting the computational power of rapidly evolving graphics processing unit (GPU) technology [Opt. Express 22, 2933 (2014)]. However, the current state-of-the-art system requires the presence of an acousto-optic deflector (AOD) to produce images at the end of the fiber without interference effects between neighboring output points. Here, we present a system free from the AOD complexity where we achieve the removal of the undesired interference effects computationally using GPU implemented Gerchberg-Saxton and Yang-Gu algorithms. The GPU implementation is two orders of magnitude faster than the CPU implementation which allows video-rate image control at the distal end of the fiber virtually free of interference effects.
Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing

Directory of Open Access Journals (Sweden)

Qiang Liu

2018-05-01

Full Text Available Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite volume scheme was proposed for flood simulation. To realize a fast simulation of large-scale floods on a personal computer, a Graphics Processing Unit (GPU-based, high-performance computing method using the OpenACC application was adopted to parallelize the shallow water model. An unstructured data management method was presented to control the data transportation between the GPU and CPU (Central Processing Unit with minimum overhead, and then both computation and data were offloaded from the CPU to the GPU, which exploited the computational capability of the GPU as much as possible. The parallel model was validated using various benchmarks and real-world case studies. The results demonstrate that speed-ups of up to one order of magnitude can be achieved in comparison with the serial model. The proposed parallel model provides a fast and reliable tool with which to quickly assess flood hazards in large-scale areas and, thus, has a bright application prospect for dynamic inundation risk identification and disaster assessment.
Using the CPU and GPU for real-time video enhancement on a mobile computer

CSIR Research Space (South Africa)

Bachoo, AK

2010-09-01

Full Text Available . In this paper, the current advances in mobile CPU and GPU hardware are used to implement video enhancement algorithms in a new way on a mobile computer. Both the CPU and GPU are used effectively to achieve realtime performance for complex image enhancement...

Upgrading the power supplies of TEXTOR for three Tesla operation

International Nuclear Information System (INIS)

Giesen, B.; Veiders, E.; Petree, F.; Fink, R.; Wagnitz, R.

1986-01-01

The toroidal magnetic system of TEXTOR can tolerate a magnetic field load of up to 2.6 Tesla routinely at full plasma current, and of up to 3 Tesla under certain boundary conditions and for a restricted number of discharges. The original power supply which can generate a toroidal magnetic field of 2 Tesla has been upgraded to operate at a field strength of 3 Tesla, by adding a new, controlled rectifier, with its own independent control, connected in parallel with the first. Studies were undertaken to determine its behaviour where control is lost, such as when a circuit breaker trips or in ''freewheel'' operation. This paper analyzes this asymmetrical arrangement and discusses the danger of damaging the smaller unit by commutating a large current into it. Moreover, in order to improve the availability of TEXTOR, the new controlled rectifier is redundant to two other units that control the vertical field and the ohmic heating coil currents. For this purpose the two bridges of this 12-pulse system are to be changed from a parallel to a series connexion, the free-wheeling diodes are disconnected and redeployed to block the large voltage pulses that are induced at plasma initiation, and 3-phase ''freewheeling'' thyristors are added that serve to reduce reactive power consumption
TESLA accelerator installation

International Nuclear Information System (INIS)

Neskovic, N.; Ostojic, R.; Susini, A.; Milinkovic, Lj.; Ciric, D.; Dobrosavljevic, A.; Brajuskovic, B.; Cirkovic, S.; Bojovic, B.; Josipovic, M.

1992-01-01

The TESLA accelerator Installation is described. Its main parts are the VINCY Cyclotron, the multiply charged heavy-ion mVINIS Ion Source, and the negative light-ion pVINIS Ion Source. The Installation should be the principal installation of a regional center for basic and applied research in nuclear physics, atomic physics, surface physics and solid state physics, for production of radioisotopes, for research and therapy in nuclear medicine. The first extraction of the ion beam from the Cyclotron is planned for 1995. (R.P.) 3 refs.; 1 fig
Gfargo: Fargo for Gpu

Science.gov (United States)

Masset, Frédéric

2015-09-01

GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
Operating experience with superconducting cavities at the TESLA test facility

International Nuclear Information System (INIS)

Moeller, Wolf-Dietrich

2003-01-01

A description of the TESLA Test Facility, which has been set up at DESY by the TeV Energy Superconducting Accelerator (TESLA) collaboration, will be given as it is now after five years of installation and operation. The experience with the first three modules, each containing 8 superconducting 9-cell cavities, installed and operated in the TTF-linac will be described. The measurements in the vertical and horizontal cryostats as well as in the modules will be compared. Recent results of the operation at the TESLA design current, macropulses of 800 μsec with bunches of 3.2 nC at a rate of 2.25 MHz are given. New measurement results of the higher order modes (HOM) will be presented. The operation and optimisation of the TTF Free Electron Laser (TTF-FEL) will also be covered in this paper. (author)
In vivo high-resolution 7 Tesla MRI shows early and diffuse cortical alterations in CADASIL.

Science.gov (United States)

De Guio, François; Reyes, Sonia; Vignaud, Alexandre; Duering, Marco; Ropele, Stefan; Duchesnay, Edouard; Chabriat, Hugues; Jouvent, Eric

2014-01-01

Recent data suggest that early symptoms may be related to cortex alterations in CADASIL (Cerebral Autosomal-Dominant Arteriopathy with Subcortical Infarcts and Leukoencephalopathy), a monogenic model of cerebral small vessel disease (SVD). The aim of this study was to investigate cortical alterations using both high-resolution T2* acquisitions obtained with 7 Tesla MRI and structural T1 images with 3 Tesla MRI in CADASIL patients with no or only mild symptomatology (modified Rankin's scale ≤1 and Mini Mental State Examination (MMSE) ≥24). Complete reconstructions of the cortex using 7 Tesla T2* acquisitions with 0.7 mm isotropic resolution were obtained in 11 patients (52.1±13.2 years, 36% male) and 24 controls (54.8±11.0 years, 42% male). Seven Tesla T2* within the cortex and cortical thickness and morphology obtained from 3 Tesla images were compared between CADASIL and control subjects using general linear models. MMSE, brain volume, cortical thickness and global sulcal morphology did not differ between groups. By contrast, T2* measured by 7 Tesla MRI was significantly increased in frontal, parietal, occipital and cingulate cortices in patients after correction for multiple testing. These changes were not related to white matter lesions, lacunes or microhemorrhages in patients having no brain atrophy compared to controls. Seven Tesla MRI, by contrast to state of the art post-processing of 3 Tesla acquisitions, shows diffuse T2* alterations within the cortical mantle in CADASIL whose origin remains to be determined.
In vivo high-resolution 7 Tesla MRI shows early and diffuse cortical alterations in CADASIL.

Directory of Open Access Journals (Sweden)

François De Guio

Full Text Available Recent data suggest that early symptoms may be related to cortex alterations in CADASIL (Cerebral Autosomal-Dominant Arteriopathy with Subcortical Infarcts and Leukoencephalopathy, a monogenic model of cerebral small vessel disease (SVD. The aim of this study was to investigate cortical alterations using both high-resolution T2* acquisitions obtained with 7 Tesla MRI and structural T1 images with 3 Tesla MRI in CADASIL patients with no or only mild symptomatology (modified Rankin's scale ≤1 and Mini Mental State Examination (MMSE ≥24.Complete reconstructions of the cortex using 7 Tesla T2* acquisitions with 0.7 mm isotropic resolution were obtained in 11 patients (52.1±13.2 years, 36% male and 24 controls (54.8±11.0 years, 42% male. Seven Tesla T2* within the cortex and cortical thickness and morphology obtained from 3 Tesla images were compared between CADASIL and control subjects using general linear models.MMSE, brain volume, cortical thickness and global sulcal morphology did not differ between groups. By contrast, T2* measured by 7 Tesla MRI was significantly increased in frontal, parietal, occipital and cingulate cortices in patients after correction for multiple testing. These changes were not related to white matter lesions, lacunes or microhemorrhages in patients having no brain atrophy compared to controls.Seven Tesla MRI, by contrast to state of the art post-processing of 3 Tesla acquisitions, shows diffuse T2* alterations within the cortical mantle in CADASIL whose origin remains to be determined.
Democratic population decisions result in robust policy-gradient learning: a parametric study with GPU simulations.

Directory of Open Access Journals (Sweden)

Paul Richmond

2011-05-01

Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
TESLA Test Facility. Status

International Nuclear Information System (INIS)

Aune, B.

1996-01-01

The TESLA Test Facility (TTF), under construction at DESY by an international collaboration, is an R and D test bed for the superconducting option for future linear e+/e-colliders. It consists of an infrastructure to process and test the cavities and of a 500 MeV linac. The infrastructure has been installed and is fully operational. It includes a complex of clean rooms, an ultra-clean water plant, a chemical etching installation and an ultra-high vacuum furnace. The linac will consist of four cryo-modules, each containing eight 1 meter long nine-cell cavities operated at 1.3 GHz. The base accelerating field is 15 MV/m. A first injector will deliver a low charge per bunch beam, with the full average current (8 mA in pulses of 800 μs). A more powerful injector based on RF gun technology will ultimately deliver a beam with high charge and low emittance to allow measurements necessary to qualify the TESLA option and to demonstrate the possibility of operating a free electron laser based on the Self-Amplified-Spontaneous-Emission principle. Overview and status of the facility will be given. Plans for the future use of the linac are presented. (R.P.)
Graphics processing units accelerated semiclassical initial value representation molecular dynamics

Energy Technology Data Exchange (ETDEWEB)

Tamascelli, Dario; Dambrosio, Francesco Saverio [Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, 20133 Milano (Italy); Conte, Riccardo [Department of Chemistry and Cherry L. Emerson Center for Scientific Computation, Emory University, Atlanta, Georgia 30322 (United States); Ceotto, Michele, E-mail: michele.ceotto@unimi.it [Dipartimento di Chimica, Università degli Studi di Milano, via Golgi 19, 20133 Milano (Italy)

2014-05-07

This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20), respectively, versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
Electron scattering with polarized targets at TESLA

International Nuclear Information System (INIS)

Anselmino, M.; Aschenauer, E.C.; Belostotski, S.

2000-11-01

Measurements of polarized electron-nucleon scattering can be realized at the TESLA linear collider facility with projected luminosities that are about two orders of magnitude higher than those expected of other experiments at comparable energies. Longitudinally polarized electrons, accelerated as a small fraction of the total current in the e + arm of TESLA, can be directed onto a solid state target that may be either longitudinally or transversely polarized. A large variety of polarized parton distribution and fragmentation functions can be determined with unprecedented accuracy, many of them for the first time. A main goal of the experiment is the precise measurement of the x- and Q 2 -dependence of the experimentally totally unknown quark transversity distributions that will complete the information on the nucleon's quark spin structure as relevant for high energy processes. Comparing their Q 2 -evolution to that of the corresponding helicity distributions constitutes an important precision test of the predictive power of QCD in the spin sector. Measuring transversity distributions and tensor charges allows access to the hitherto unmeasured chirally odd operators in QCD which are of great importance to understand the role of chiral symmetry. The possibilities of using unpolarized targets and of experiments with a real photon beam turn TESLA-N into a versatile next-generation facility at the intersection of particle and nuclear physics. (orig.)
Enhancing professionalism at GPU nuclear

International Nuclear Information System (INIS)

Coe, R.P.; Landy, F.J.

1991-01-01

Late in 1988, GPU Nuclear embarked on a major program aimed at enhancing Professionalism at its Oyster Creek and Three Mile Island Nuclear Generating Stations. The program was also to include its Corporate Headquarters in Parsippany, New Jersey. The overall program was to take several directions which included on-site degree programs, a sabbatical leave-type program for personnel to finish college degrees, advanced technical training for licensed staff, career progression for SROs and expanded teamwork and leadership training for control room crews. The largest portion of this initiative was the development and delivery of professionalism training to the nearly two thousand people at both sites. Three primary philosophies guided the development of the program. Employees as Experts: First, GPU Nuclear employees were considered to be the most valuable source of information for designing a Professionalism program because it is these individuals who are sensitive to the issues encountered in the workplace. Realism: The second philosophy guiding this effort was that the program must be grounded in real life challenges that employees face and must address. Active Learning: The third guiding philosophy was that, in order to have any real impact on the way employees think about professionalism, the program must utilize active rather than passive learning techniques
A fast and accurate image reconstruction using GPU for OpenPET prototype

International Nuclear Information System (INIS)

Kinouchi, Shoko; Suga, Mikio; Yamaya, Taiga; Yoshida, Eiji

2010-01-01

The OpenPET (positron emission tomography), which have a physically opened space between two detector rings, is our new geometry to enable PET imaging during radiation therapy if the real-time imaging system is realized. In this paper, therefore, we developed a list-mode image reconstruction method using general purpose graphic processing units (GPUs). We used the list-mode dynamic row-action maximum likelihood algorithm (DRAMA). For GPU implementation, the efficiency of acceleration depends on the implementation method which is required to avoid conditional statements. We developed a system model in which each element of system matrix is calculated as the value of detector response function (DRF) of the length between the center of a voxel and a line of response (LOR). The system model was suited to GPU implementations that enable us to calculate each element of the system matrix with reduced number of the conditional statements. We applied the developed method to a small OpenPET prototype, which was developed for a proof-of-concept. We measured the micro-Derenzo phantom placed at the gap. The results showed that the same quality of reconstructed images using GPU as using central processing unit (CPU) were achieved, and calculation speed on the GPU was 35.5 times faster than that on the CPU. (author)
GPU Accelerated Surgical Simulators for Complex Morhpology

DEFF Research Database (Denmark)

Mosegaard, Jesper; Sørensen, Thomas Sangild

2005-01-01

a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...
Gadolinium-based magnetic resonance contrast agents at 7 Tesla: in vitro T1 relaxivities in human blood plasma.

Science.gov (United States)

Noebauer-Huhmann, Iris M; Szomolanyi, Pavol; Juras, Vladimír; Kraff, Oliver; Ladd, Mark E; Trattnig, Siegfried

2010-09-01

PURPOSE/INTRODUCTION: The aim of this study was to determine the T1 relaxivities (r1) of 8 gadolinium (Gd)-based MR contrast agents in human blood plasma at 7 Tesla, compared with 3 Tesla. Eight commercially available Gd-based MR contrast agents were diluted in human blood plasma to concentrations of 0, 0.25, 0.5, 1, and 2 mmol/L. In vitro measurements were performed at 37 degrees C, on a 7 Tesla and on a 3 Tesla whole-body magnetic resonance imaging scanner. For the determination of T1 relaxation times, Inversion Recovery Sequences with inversion times from 0 to 3500 ms were used. The relaxivities were calculated. The r1 relaxivities of all agents, diluted in human blood plasma at body temperature, were lower at 7 Tesla than at 3 Tesla. The values at 3 Tesla were comparable to those published earlier. Notably, in some agents, a minor negative correlation of r1 with a concentration of up to 2 mmol/L could be observed. This was most pronounced in the agents with the highest protein-binding capacity. At 7 Tesla, the in vitro r1 relaxivities of Gd-based contrast agents in human blood plasma are lower than those at 3 Tesla. This work may serve as a basis for the application of Gd-based MR contrast agents at 7 Tesla. Further studies are required to optimize the contrast agent dose in vivo.
GPU-accelerated ray-tracing for real-time treatment planning

International Nuclear Information System (INIS)

Heinrich, H; Ziegenhein, P; Kamerling, C P; Oelfke, U; Froening, H

2014-01-01

Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.
GPU Lossless Hyperspectral Data Compression System for Space Applications

Science.gov (United States)

Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled

2012-01-01

On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
NAVIER-STOKES EM GPU

OpenAIRE

ALEX LAIER BORDIGNON

2006-01-01

Nesse trabalho, mostramos como simular um fluido em duas dimensÃµes em um domÃnio com fronteiras arbitrÃ¡rias. Nosso trabalho Ã© baseado no esquema stable fluids desenvolvido por Joe Stam. A implementaÃ§Ã£o Ã© feita na GPU (Graphics Processing Unit), permitindo velocidade de interaÃ§Ã£o com o fluido. Fazemos uso da linguagem Cg (C for Graphics), desenvolvida pela companhia NVidia. Nossas principais contribuiÃ§Ãµes sÃ£o o tratamento das mÃºltiplas fronteiras, o...
Accelerating the XGBoost algorithm using GPU computing

Directory of Open Access Journals (Sweden)

Rory Mitchell

2017-07-01

Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
GPU-accelerated simulations of isolated black holes

Science.gov (United States)

Lewis, Adam G. M.; Pfeiffer, Harald P.

2018-05-01

We present a port of the numerical relativity code SpEC which is capable of running on NVIDIA GPUs. Since this code must be maintained in parallel with SpEC itself, a primary design consideration is to perform as few explicit code changes as possible. We therefore rely on a hierarchy of automated porting strategies. At the highest level we use TLoops, a C++ library of our design, to automatically emit CUDA code equivalent to tensorial expressions written into C++ source using a syntax similar to analytic calculation. Next, we trace out and cache explicit matrix representations of the numerous linear transformations in the SpEC code, which allows these to be performed on the GPU using pre-existing matrix-multiplication libraries. We port the few remaining important modules by hand. In this paper we detail the specifics of our port, and present benchmarks of it simulating isolated black hole spacetimes on several generations of NVIDIA GPU.
SU-E-T-423: Fast Photon Convolution Calculation with a 3D-Ideal Kernel On the GPU

Energy Technology Data Exchange (ETDEWEB)

Moriya, S; Sato, M [Komazawa University, Setagaya, Tokyo (Japan); Tachibana, H [National Cancer Center Hospital East, Kashiwa, Chiba (Japan)

2015-06-15

Purpose: The calculation time is a trade-off for improving the accuracy of convolution dose calculation with fine calculation spacing of the KERMA kernel. We investigated to accelerate the convolution calculation using an ideal kernel on the Graphic Processing Units (GPU). Methods: The calculation was performed on the AMD graphics hardware of Dual FirePro D700 and our algorithm was implemented using the Aparapi that convert Java bytecode to OpenCL. The process of dose calculation was separated with the TERMA and KERMA steps. The dose deposited at the coordinate (x, y, z) was determined in the process. In the dose calculation running on the central processing unit (CPU) of Intel Xeon E5, the calculation loops were performed for all calculation points. On the GPU computation, all of the calculation processes for the points were sent to the GPU and the multi-thread computation was done. In this study, the dose calculation was performed in a water equivalent homogeneous phantom with 150{sup 3} voxels (2 mm calculation grid) and the calculation speed on the GPU to that on the CPU and the accuracy of PDD were compared. Results: The calculation time for the GPU and the CPU were 3.3 sec and 4.4 hour, respectively. The calculation speed for the GPU was 4800 times faster than that for the CPU. The PDD curve for the GPU was perfectly matched to that for the CPU. Conclusion: The convolution calculation with the ideal kernel on the GPU was clinically acceptable for time and may be more accurate in an inhomogeneous region. Intensity modulated arc therapy needs dose calculations for different gantry angles at many control points. Thus, it would be more practical that the kernel uses a coarse spacing technique if the calculation is faster while keeping the similar accuracy to a current treatment planning system.

Prostate cancer detection by prebiopsy 3.0-tesla magnetic resonance imaging

International Nuclear Information System (INIS)

Nishida, Sachiyo; Kinoshita, Hidefumi; Mishima, Takao; Kurokawa, Hiroaki; Sakaida, Noriko; Matsuda, Tadashi

2011-01-01

The diagnostic value of 3.0-Tesla magnetic resonance imaging (MRI) for prostate cancer remains to be determined. The aim of the present study was to assess the features of prostate cancer detectable by prebiopsy 3.0-Tesla MRI. From January 2007 through to December 2008, 116 patients who were examined by prebiopsy 3.0-Tesla MRI underwent radical prostatectomy for localized prostate cancer. Prostate specimens were examined to see whether the largest cancer area was the same as the area indicated on the MRI. Univariate and multivariate logistic regression analyses were conducted to identify variables predictive of agreement between MRI and histopathological findings. Sixty-six (56.9%) patients were suspected of having prostate cancer on the basis of MRI findings. In 49 of these patients (74.2%), it was considered that there was agreement between the abnormal area on the MRI and the index tumor. Univariate analysis revealed that there were significant differences in abnormal digital rectal examination, capsular penetration, the diameter of the index tumor of the radical prostatectomy specimen, and the Gleason scores of the biopsy and radical prostatectomy specimens. Multivariate analysis revealed that the Gleason score of the radical prostatectomy specimen was associated with the accurate detection of the prostate cancer by MRI (P=0.0177). In conclusion, 3.0-Tesla MRI tends to accurately diagnose prostate cancer with high tumor burden and aggressiveness. Multimodal examination (T2-weighted imaging, dynamic contrast-enhanced imaging, and diffusion-weighted imaging) is recommended for the diagnosis of prostate cancer using 3.0-Tesla MRI. (author)
GPU accelerated fully space and time resolved numerical simulations of self-focusing laser beams in SBS-active media

Energy Technology Data Exchange (ETDEWEB)

Mauger, Sarah; Colin de Verdière, Guillaume [CEA-DAM, DIF, 91297 Arpajon (France); Bergé, Luc, E-mail: luc.berge@cea.fr [CEA-DAM, DIF, 91297 Arpajon (France); Skupin, Stefan [Max Planck Institute for the Physics of Complex Systems, 01187 Dresden (Germany); Friedrich Schiller University, Institute of Condensed Matter Theory and Optics, 07743 Jena (Germany)

2013-02-15

A computer cluster equipped with Graphics Processing Units (GPUs) is used for simulating nonlinear optical wave packets undergoing Kerr self-focusing and stimulated Brillouin scattering in fused silica. We first recall the model equations in full (3+1) dimensions. These consist of two coupled nonlinear Schrödinger equations for counterpropagating optical beams closed with a source equation for light-induced acoustic waves seeded by thermal noise. Compared with simulations on a conventional cluster of Central Processing Units (CPUs), GPU-based computations allow us to use a significant (16 times) larger number of mesh points within similar computation times. Reciprocally, simulations employing the same number of mesh points are between 3 and 20 times faster on GPUs than on the same number of classical CPUs. Performance speedups close to 45 are reported for isolated functions evaluating, e.g., the optical nonlinearities. Since the field intensities may reach the ionization threshold of silica, the action of a defocusing electron plasma is also addressed.
GPU accelerated fully space and time resolved numerical simulations of self-focusing laser beams in SBS-active media

International Nuclear Information System (INIS)

Mauger, Sarah; Colin de Verdière, Guillaume; Bergé, Luc; Skupin, Stefan

2013-01-01

A computer cluster equipped with Graphics Processing Units (GPUs) is used for simulating nonlinear optical wave packets undergoing Kerr self-focusing and stimulated Brillouin scattering in fused silica. We first recall the model equations in full (3+1) dimensions. These consist of two coupled nonlinear Schrödinger equations for counterpropagating optical beams closed with a source equation for light-induced acoustic waves seeded by thermal noise. Compared with simulations on a conventional cluster of Central Processing Units (CPUs), GPU-based computations allow us to use a significant (16 times) larger number of mesh points within similar computation times. Reciprocally, simulations employing the same number of mesh points are between 3 and 20 times faster on GPUs than on the same number of classical CPUs. Performance speedups close to 45 are reported for isolated functions evaluating, e.g., the optical nonlinearities. Since the field intensities may reach the ionization threshold of silica, the action of a defocusing electron plasma is also addressed
A high-speed DAQ framework for future high-level trigger and event building clusters

International Nuclear Information System (INIS)

Caselle, M.; Perez, L.E. Ardila; Balzer, M.; Dritschler, T.; Kopmann, A.; Mohr, H.; Rota, L.; Vogelgesang, M.; Weber, M.

2017-01-01

Modern data acquisition and trigger systems require a throughput of several GB/s and latencies of the order of microseconds. To satisfy such requirements, a heterogeneous readout system based on FPGA readout cards and GPU-based computing nodes coupled by InfiniBand has been developed. The incoming data from the back-end electronics is delivered directly into the internal memory of GPUs through a dedicated peer-to-peer PCIe communication. High performance DMA engines have been developed for direct communication between FPGAs and GPUs using 'DirectGMA (AMD)' and 'GPUDirect (NVIDIA)' technologies. The proposed infrastructure is a candidate for future generations of event building clusters, high-level trigger filter farms and low-level trigger system. In this paper the heterogeneous FPGA-GPU architecture will be presented and its performance be discussed.
76 FR 60118 - Tesla Motors, Inc. Grant of Petition for Renewal of a Temporary Exemption From the Advanced Air...

Science.gov (United States)

2011-09-28

...-0070] Tesla Motors, Inc. Grant of Petition for Renewal of a Temporary Exemption From the Advanced Air... Protection. SUMMARY: This notice grants the petition of Tesla Motors, Inc. (Tesla) for the renewal of a... the one for Tesla. Over time, the number of petitions for exemption from the advanced air bag...
Theoretical and Experimental Research Performed on the Tesla Turbine - Part I

Directory of Open Access Journals (Sweden)

Dorian Nedelcu

2015-09-01

Full Text Available The paper presents the theoretical and experimental research performed on a Tesla turbine driven by compressed air and designed to equip a teaching laboratory [1], [2]. It introduces the operating principle of the Tesla turbine, which was invented by engineer Nikola Tesla, a turbine which uses discs instead of blades, mounted on a shaft at a small distance between them. The turbine geometry, results from stress and flow calculations performed on the turbine rotor and assembly, using the Simulation modules and SolidWorks Flow Simulation program are presented. After designing the turbine, it becomes the subject of experimental research to determine the curve of the speed depending on the pressure. Also, the experimental research focuses on the behaviour of the turbine from a dynamic point of view [3].
Nikola Tesla: Why was he so much resisted and forgotten? [Retrospectroscope].

Science.gov (United States)

Valentinuzzi, Max E; Ortiz, Martin Hill; Cervantes, Daniel; Leder, Ron S

2016-01-01

Recently, during the Christmas season, a friend of mine visited me and, sneaking a look at my bookshelves, found two rather old Nikola Tesla biographies, which I had used to prepare a "Retrospectroscope" column for the then-named IEEE Engineering in Medicine and Biology Magazine when our dear friend Alvin Wald was its editor-inchief [2]. Eighteen years have elapsed since then; soon, the idea came up of revamping the article. Cynthia Weber, the magazine's current associate editor, considered it acceptable, and here is the new note divided in two parts: that is, a slightly revised version of the original article followed by new material, including some quite interesting information regarding Tesla's homes and laboratories. On top of this, Tesla is not devoid of a science fiction touch, as mentioned at the end.
Interior Point Methods on GPU with application to Model Predictive Control

DEFF Research Database (Denmark)

Gade-Nielsen, Nicolai Fog

The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now...... software package called GPUOPT, available under the non-restrictive MIT license. GPUOPT includes includes a primal-dual interior-point method, which supports both the CPU and the GPU. It is implemented as multiple components, where the matrix operations and solver for the Newton directions is separated...
NIKOLA TESLA AND MEDICINE: 160TH ANNIVERSARY OF THE BIRTH OF THE GENIUS WHO GAVE LIGHT TO THE WORLD - PART II.

Science.gov (United States)

Vucevic, Danijela; Dordevic, Drago; Radosavljevic, Tatjana

2016-11-01

Nikola Tesla (1856- 1943) was a genius inventor and scientist, whose contribution to medicine is remarkable. Part I of this article reviewed special contributions of the world renowned scientist to the establishment of radiology as a new discipline in medicine. This paper deals with the use of Tesla currents in medicine. Tesla Currents in Medicine. Tesla's greatest impact on medicine is his invention of a transformer (Tesla coil) for producing high frequency and high voltage currents (Tesla currents). Tesla currents are used in diathermy, as they, while passing through the body, transform electrical energy into a therapeutic heat. In 1891, Tesla passed currents through his own body and was the first to experience their beneficial effects. He kept correspondence on electrotherapy with J. Dugan and S. H. Monell. In 1896, he used high frequency currents and designed an ozone generator for producing ozone, with powerful antiseptic and antibacterial properties. Tesla is famous for his extensive experiments with mechanical vibrations and resonance, examining their effects on the organ ism and pioneering their use for medical purposes. Tesla also designed an oscillator to relieve fatigue of the leg muscles. It is less known that Tesla's inventions (Tesla coil and wireless remote control) are widely used in modern medical equipment. Apart from this, wireless technology is nowadays widely applied in numerous diagnostic and therapeutic procedures. Nikola Tesla was the last Renais- sance figure of the modern era. Tesla bridged three centuries and two millennia by his inventions, and permanently indebted humankind by his epochal discoveries.
GPU-accelerated FDTD modeling of radio-frequency field-tissue interactions in high-field MRI.

Science.gov (United States)

Chi, Jieru; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart

2011-06-01

The analysis of high-field RF field-tissue interactions requires high-performance finite-difference time-domain (FDTD) computing. Conventional CPU-based FDTD calculations offer limited computing performance in a PC environment. This study presents a graphics processing unit (GPU)-based parallel-computing framework, producing substantially boosted computing efficiency (with a two-order speedup factor) at a PC-level cost. Specific details of implementing the FDTD method on a GPU architecture have been presented and the new computational strategy has been successfully applied to the design of a novel 8-element transceive RF coil system at 9.4 T. Facilitated by the powerful GPU-FDTD computing, the new RF coil array offers optimized fields (averaging 25% improvement in sensitivity, and 20% reduction in loop coupling compared with conventional array structures of the same size) for small animal imaging with a robust RF configuration. The GPU-enabled acceleration paves the way for FDTD to be applied for both detailed forward modeling and inverse design of MRI coils, which were previously impractical.
Low frequency AC losses in multi filamentary superconductors up to 15 Tesla

International Nuclear Information System (INIS)

Orlando, T.; Braun, C.; Foner, S.; Schwartz, B.; Zieba, A.

1983-01-01

Low frequency (1 Hz) ac losses were measured in a variety of A15 superconducting wires having different fiber geometries. Field modulations ofless than or equal to 1 tesla were superimposed on a fixed background field up to 15 tesla. Losses were measured for Nb 3 Sn in continuous fiber, modified jelly-roll, In Situ, and powder metallurgy processed materials, and for Nb 3 Al powder metallurgy processed materials. The results are compared with dc magnetization measurements. The losses are purely hysteretic at these low frequencies, scale with J /SUB c/ (above about 3 tesla), and are reduced substantially by twisting for all the materials. The lowest losses are observed for the Nb 3 Al wires
GPU-Accelerated Stony-Brook University 5-class Microphysics Scheme in WRF

Science.gov (United States)

Mielikainen, J.; Huang, B.; Huang, A.

2011-12-01

The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. Microphysics plays an important role in weather and climate prediction. Several bulk water microphysics schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities. Stony-Brook University scheme (SBU-YLIN) is a 5-class scheme with riming intensity predicted to account for mixed-phase processes. In the past few years, co-processing on Graphics Processing Units (GPUs) has been a disruptive technology in High Performance Computing (HPC). GPUs use the ever increasing transistor count for adding more processor cores. Therefore, GPUs are well suited for massively data parallel processing with high floating point arithmetic intensity. Thus, it is imperative to update legacy scientific applications to take advantage of this unprecedented increase in computing power. CUDA is an extension to the C programming language offering programming GPU's directly. It is designed so that its constructs allow for natural expression of data-level parallelism. A CUDA program is organized into two parts: a serial program running on the CPU and a CUDA kernel running on the GPU. The CUDA code consists of three computational phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. CUDA takes a bottom-up point of view of parallelism is which thread is an atomic unit of parallelism. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions. To test SBU-YLIN, we used a CONtinental United States (CONUS) benchmark data set for 12 km resolution domain for October 24, 2001. A WRF domain is a geographic region of interest discretized into a 2-dimensional grid parallel to the ground. Each grid point has
Cryogenic system for the 45 Tesla hybrid magnet

International Nuclear Information System (INIS)

Van Sciver, S.W.; Miller, J.R.; Welton, S.; Schneider-Muntau, H.J.; McIntosh, G.E.

1994-01-01

The 45 Tesla hybrid magnet system will consist of a 14 Tesla superconducting outsert magnet and a 31 Tesla water cooled insert. The magnet is planned for operation in early 1995 at the National High Magnetic Field Laboratory. Its purpose is to provide the highest DC magnetic fields for the materials research community. The present paper discusses the overall design of the cryogenic system for the superconducting magnet. Unique features of this system include static 1.8 K pressurized He II as a coolant for the magnet and a refrigerated structural support system for load transfer during fault conditions. The system will consist of two connected cryostats. The magnet is contained within one cryostat which has a clear warm bore of 616 mm and is designed to be free of system interfaces and therefore minimize interference with the magnet user. A second supply cryostat provides the connections to the refrigeration system and magnet power supply. The magnet and supply cryostats are connected to each other through a horizontal services duct section. Issues to be discussed in the present paper include design and thermal analysis of the magnet system during cooldown and in steady state operation and overall cryogenic system design
SU-F-J-166: Volumetric Spatial Distortions Comparison for 1.5 Tesla Versus 3 Tesla MRI for Gamma Knife Radiosurgery Scans Using Frame Marker Fusion and Co-Registration Modes

International Nuclear Information System (INIS)

Neyman, G

2016-01-01

Purpose: To compare typical volumetric spatial distortions for 1.5 Tesla versus 3 Tesla MRI Gamma Knife radiosurgery scans in the frame marker fusion and co-registration frame-less modes. Methods: Quasar phantom by Modus Medical Devices Inc. with GRID image distortion software was used for measurements of volumetric distortions. 3D volumetric T1 weighted scans of the phantom were produced on 1.5 T Avanto and 3 T Skyra MRI Siemens scanners. The analysis was done two ways: for scans with localizer markers from the Leksell frame and relatively to the phantom only (simulated co-registration technique). The phantom grid contained a total of 2002 vertices or control points that were used in the assessment of volumetric geometric distortion for all scans. Results: Volumetric mean absolute spatial deviations relatively to the frame localizer markers for 1.5 and 3 Tesla machine were: 1.39 ± 0.15 and 1.63 ± 0.28 mm with max errors of 1.86 and 2.65 mm correspondingly. Mean 2D errors from the Gamma Plan were 0.3 and 1.0 mm. For simulated co-registration technique the volumetric mean absolute spatial deviations relatively to the phantom for 1.5 and 3 Tesla machine were: 0.36 ± 0.08 and 0.62 ± 0.13 mm with max errors of 0.57 and 1.22 mm correspondingly. Conclusion: Volumetric spatial distortions are lower for 1.5 Tesla versus 3 Tesla MRI machines localized with markers on frames and significantly lower for co-registration techniques with no frame localization. The results show the advantage of using co-registration technique for minimizing MRI volumetric spatial distortions which can be especially important for steep dose gradient fields typically used in Gamma Knife radiosurgery. Consultant for Elekta AB
SU-F-J-166: Volumetric Spatial Distortions Comparison for 1.5 Tesla Versus 3 Tesla MRI for Gamma Knife Radiosurgery Scans Using Frame Marker Fusion and Co-Registration Modes

Energy Technology Data Exchange (ETDEWEB)

Neyman, G [The Cleveland Clinic Foundation, Cleveland, OH (United States)

2016-06-15

Purpose: To compare typical volumetric spatial distortions for 1.5 Tesla versus 3 Tesla MRI Gamma Knife radiosurgery scans in the frame marker fusion and co-registration frame-less modes. Methods: Quasar phantom by Modus Medical Devices Inc. with GRID image distortion software was used for measurements of volumetric distortions. 3D volumetric T1 weighted scans of the phantom were produced on 1.5 T Avanto and 3 T Skyra MRI Siemens scanners. The analysis was done two ways: for scans with localizer markers from the Leksell frame and relatively to the phantom only (simulated co-registration technique). The phantom grid contained a total of 2002 vertices or control points that were used in the assessment of volumetric geometric distortion for all scans. Results: Volumetric mean absolute spatial deviations relatively to the frame localizer markers for 1.5 and 3 Tesla machine were: 1.39 ± 0.15 and 1.63 ± 0.28 mm with max errors of 1.86 and 2.65 mm correspondingly. Mean 2D errors from the Gamma Plan were 0.3 and 1.0 mm. For simulated co-registration technique the volumetric mean absolute spatial deviations relatively to the phantom for 1.5 and 3 Tesla machine were: 0.36 ± 0.08 and 0.62 ± 0.13 mm with max errors of 0.57 and 1.22 mm correspondingly. Conclusion: Volumetric spatial distortions are lower for 1.5 Tesla versus 3 Tesla MRI machines localized with markers on frames and significantly lower for co-registration techniques with no frame localization. The results show the advantage of using co-registration technique for minimizing MRI volumetric spatial distortions which can be especially important for steep dose gradient fields typically used in Gamma Knife radiosurgery. Consultant for Elekta AB.
GPU accelerated FDTD solver and its application in MRI.

Science.gov (United States)

Chi, J; Liu, F; Jin, J; Mason, D G; Crozier, S

2010-01-01

The finite difference time domain (FDTD) method is a popular technique for computational electromagnetics (CEM). The large computational power often required, however, has been a limiting factor for its applications. In this paper, we will present a graphics processing unit (GPU)-based parallel FDTD solver and its successful application to the investigation of a novel B1 shimming scheme for high-field magnetic resonance imaging (MRI). The optimized shimming scheme exhibits considerably improved transmit B(1) profiles. The GPU implementation dramatically shortened the runtime of FDTD simulation of electromagnetic field compared with its CPU counterpart. The acceleration in runtime has made such investigation possible, and will pave the way for other studies of large-scale computational electromagnetic problems in modern MRI which were previously impractical.
GPU based acceleration of first principles calculation

International Nuclear Information System (INIS)

Tomono, H; Tsumuraya, K; Aoki, M; Iitaka, T

2010-01-01

We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.
FULL GPU Implementation of Lattice-Boltzmann Methods with Immersed Boundary Conditions for Fast Fluid Simulations

Directory of Open Access Journals (Sweden)

G Boroni

2017-03-01

Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Energy Technology Data Exchange (ETDEWEB)

Ronald Babich, Michael Clark, Balint Joo

2010-11-01

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

International Nuclear Information System (INIS)

Babich, Ronald; Clark, Michael; Joo, Balint

2010-01-01

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the '9g' cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.

High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU

Directory of Open Access Journals (Sweden)

STOJANOVIC, N.

2014-11-01

Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.
Study on GPU Computing for SCOPE2 with CUDA

International Nuclear Information System (INIS)

Kodama, Yasuhiro; Tatsumi, Masahiro; Ohoka, Yasunori

2011-01-01

For improving safety and cost effectiveness of nuclear power plants, a core calculation code SCOPE2 has been developed, which adopts detailed calculation models such as the multi-group nodal SP3 transport calculation method in three-dimensional pin-by-pin geometry to achieve high predictability. However, it is difficult to apply the code to loading pattern optimizations since it requires much longer computation time than that of codes based on the nodal diffusion method which is widely used in core design calculations. In this study, we studied possibility of acceleration of SCOPE2 with GPU computing capability which has been recognized as one of the most promising direction of high performance computing. In the previous study with an experimental programming framework, it required much effort to convert the algorithms to ones which fit to GPU computation. It was found, however, that this conversion was tremendously difficult because of the complexity of algorithms and restrictions in implementation. In this study, to overcome this complexity, we utilized the CUDA programming environment provided by NVIDIA which is a versatile and flexible language as an extension to the C/C++ languages. It was confirmed that we could enjoy high performance without degradation of maintainability through test implementation of GPU kernels for neutron diffusion/simplified P3 equation solvers. (author)
Beamstrahlung Photon Load on the TESLA Extraction Septum Blade(LCC-0104)

Energy Technology Data Exchange (ETDEWEB)

Seryi, A

2003-10-02

This note describes work performed in the framework of the International Linear Collider Technical Review Committee [1] to estimate the power load on the TESLA extraction septum blade due to beamstrahlung photons. It is shown, that under realistic conditions the photon load can be several orders of magnitude higher than what was estimated in the TESLA TDR [2] for the ideal Gaussian beams, potentially representing a serious limitation of the current design.
Beamstrahlung Photon Load on the TESLA Extraction Septum Blade(LCC-0104)

International Nuclear Information System (INIS)

Seryi, A

2003-01-01

This note describes work performed in the framework of the International Linear Collider Technical Review Committee [1] to estimate the power load on the TESLA extraction septum blade due to beamstrahlung photons. It is shown, that under realistic conditions the photon load can be several orders of magnitude higher than what was estimated in the TESLA TDR [2] for the ideal Gaussian beams, potentially representing a serious limitation of the current design
GPU Based Software Correlators - Perspectives for VLBI2010

Science.gov (United States)

Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun

2010-01-01

Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
Hybrid GPU-CPU adaptive precision ray-triangle intersection tests for robust high-performance GPU dosimetry computations

International Nuclear Information System (INIS)

Perrotte, Lancelot; Bodin, Bruno; Chodorge, Laurent

2011-01-01

Before an intervention on a nuclear site, it is essential to study different scenarios to identify the less dangerous one for the operator. Therefore, it is mandatory to dispose of an efficient dosimetry simulation code with accurate results. One classical method in radiation protection is the straight-line attenuation method with build-up factors. In the case of 3D industrial scenes composed of meshes, the computation cost resides in the fast computation of all of the intersections between the rays and the triangles of the scene. Efficient GPU algorithms have already been proposed, that enable dosimetry calculation for a huge scene (800000 rays, 800000 triangles) in a fraction of second. But these algorithms are not robust: because of the rounding caused by floating-point arithmetic, the numerical results of the ray-triangle intersection tests can differ from the expected mathematical results. In worst case scenario, this can lead to a computed dose rate dramatically inferior to the real dose rate to which the operator is exposed. In this paper, we present a hybrid GPU-CPU algorithm to manage adaptive precision floating-point arithmetic. This algorithm allows robust ray-triangle intersection tests, with very small loss of performance (less than 5 % overhead), and without any need for scene-dependent tuning. (author)
Design and performance of a Tesla transformer type relativistic electron beam generator

International Nuclear Information System (INIS)

Jain, K.K.; Chennareddy, D.; John, P.I.; Saxena, Y.C.

1986-01-01

A relativistic electron beam generator driven by an air core Tesla transformer is described. The Tesla transformer circuit analysis is outlined and computational results are presented for the case when the coaxial water line has finite resistance. The transformer has a coupling coefficient of 0.56 and a step-up ratio of 25. The Tesla transformer can provide 800 kV at the peak of the second half cycle of the secondary output voltage and has been tested up to 600 kV. A 100-200 keV, 15-20 kA electron beam having 150 ns pulse width has been obtained. The beam generator described is being used for the beam injection into a toroidal device BETA. (author). 20 refs. 9 figures
Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU.

Science.gov (United States)

Shen, Wenfeng; Wei, Daming; Xu, Weimin; Zhu, Xin; Yuan, Shizhong

2010-10-01

Biological computations like electrocardiological modelling and simulation usually require high-performance computing environments. This paper introduces an implementation of parallel computation for computer simulation of electrocardiograms (ECGs) in a personal computer environment with an Intel CPU of Core (TM) 2 Quad Q6600 and a GPU of Geforce 8800GT, with software support by OpenMP and CUDA. It was tested in three parallelization device setups: (a) a four-core CPU without a general-purpose GPU, (b) a general-purpose GPU plus 1 core of CPU, and (c) a four-core CPU plus a general-purpose GPU. To effectively take advantage of a multi-core CPU and a general-purpose GPU, an algorithm based on load-prediction dynamic scheduling was developed and applied to setting (c). In the simulation with 1600 time steps, the speedup of the parallel computation as compared to the serial computation was 3.9 in setting (a), 16.8 in setting (b), and 20.0 in setting (c). This study demonstrates that a current PC with a multi-core CPU and a general-purpose GPU provides a good environment for parallel computations in biological modelling and simulation studies. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.
Real-Time GPU Implementation of Transverse Oscillation Vector Velocity Flow Imaging

DEFF Research Database (Denmark)

Bradway, David; Pihl, Michael Johannes; Krebs, Andreas

2014-01-01

Rapid estimation of blood velocity and visualization of complex flow patterns are important for clinical use of diagnostic ultrasound. This paper presents real-time processing for two-dimensional (2-D) vector flow imaging which utilizes an off-the-shelf graphics processing unit (GPU). In this work...... vector flow acquisition takes 2.3 milliseconds seconds on an Advanced Micro Devices Radeon HD 7850 GPU card. The detected velocities are accurate to within the precision limit of the output format of the display routine. Because this tool was developed as a module external to the scanner’s built...
GPU accelerated likelihoods for stereo-based articulated tracking

DEFF Research Database (Denmark)

Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny

2010-01-01

than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds....
TeSLA presentatie voor medewerkers van AMN

NARCIS (Netherlands)

Janssen, José

2017-01-01

Presentatie over Online toetsen voor medewerkers van AMN (www.amn.nl). Topics: assessment onderzoek Welten-instituut en meer in het bijzonder het TeSLA project waarin instrumenten voor authenticatie en auteurschap verificatie worden gecombineerd om betrouwbaar toetsen op afstand mogelijk te maken.
Hippocampal MRI volumetry at 3 Tesla: reliability and practical guidance.

Science.gov (United States)

Jeukens, Cécile R L P N; Vlooswijk, Mariëlle C G; Majoie, H J Marian; de Krom, Marc C T F M; Aldenkamp, Albert P; Hofman, Paul A M; Jansen, Jacobus F A; Backes, Walter H

2009-09-01

Although volumetry of the hippocampus is considered to be an established technique, protocols reported in literature are not described in great detail. This article provides a complete and detailed protocol for hippocampal volumetry applicable to T1-weighted magnetic resonance (MR) images acquired at 3 Tesla, which has become the standard for structural brain research. The protocol encompasses T1-weighted image acquisition at 3 Tesla, anatomic guidelines for manual hippocampus delineation, requirements of delineation software, reliability measures, and criteria to assess and ensure sufficient reliability. Moreover, the validity of the correction for total intracranial volume size was critically assessed. The protocol was applied by 2 readers to the MR images of 36 patients with cryptogenic localization-related epilepsy, 4 patients with unilateral hippocampal sclerosis, and 20 healthy control subjects. The uncorrected hippocampal volumes were 2923 +/- 500 mm3 (mean +/- SD) (left) and 3120 +/- 416 mm3 (right) for the patient group and 3185 +/- 411 mm3 (left) and 3302 +/- 411 mm3 (right) for the healthy control group. The volume of the 4 pathologic hippocampi of the patients with unilateral hippocampal sclerosis was 2980 +/- 422 mm3. The inter-reader reliability values were determined: intraclass-correlation-coefficient (ICC) = 0.87 (left) and 0.86 (right), percentage volume difference (VD) = 7.0 +/- 4.7% (left) and 6.0 +/- 3.8% (right), and overlap ratio (OR) = 0.82 +/- 0.04 (left) and 0.82 +/- 0.03 (right). The positive Pearson correlation between hippocampal volume and total intracranial volume was found to be low: r = 0.48 (P = 0.03, left) and r = 0.62 (P = 0.004, right) and did not significantly reduce the volumetric variances, showing the limited benefit of the brain size correction. A protocol was described to determine hippocampal volumes based on 3 Tesla MR images with high inter-reader reliability. Although the reliability of hippocampal volumetry at 3 Tesla
Nucleonic instruments from VUPJT Tesla

International Nuclear Information System (INIS)

Smola, J.

1986-01-01

The instruments currently produced by Tesla Premysleni are listed and briefly characterized. They include a low level alpha-beta counter, an automatic low level alpha-beta counter, detection units for environmental sample counting, instruments for measuring specific activity of liquids and radon concentration in water, a radioactive aerosol meter, dose ratemeters, portable alpha-beta indicators for surface contamintion monitoring, neutron monitors, single-, two- and three-channel spectrometric units. (M.D.)
An optimized D2Q37 Lattice Boltzmann code on GP-GPUs

NARCIS (Netherlands)

Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Scagliarini, Andrea; Schifano, S.F.; Toschi, F.

2013-01-01

We describe the implementation of a thermal compressible Lattice Boltzmann algorithm on an NVIDIA Tesla C2050 system based on the Fermi GP-GPU. We consider two different versions, including and not including reactive effects. We describe the overall organization of the algorithm and give details on
First Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC

CERN Document Server

Halyo, V.; Lujan, P.; Karpusenko, V.; Vladimirov, A.

2014-04-07

Recent innovations focused around {\\em parallel} processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA's Tesla Graphics Processing Unit (GPU) or Intel's \\xphi, in the High Level Trigger. These accelerators have the potential to provide faster or more energy efficient event selection, thus opening up possibilities for new complex triggers that were not previously feasible. At the same time, it is crucial to explore the performance limits achievable on the latest generation multicore CPUs with the use of the best software optimization methods. In this article, a new tracking algorithm based on the Hough transform will be evaluated for the first time on a multi-core Intel Xeon E5-2697v2 CPU, an NVIDIA Tesla K20c GPU, and an Intel \\x...
GPU-accelerated brain connectivity reconstruction and visualization in large-scale electron micrographs

KAUST Repository

Jeong, Wonki

2011-01-01

This chapter introduces a GPU-accelerated interactive, semiautomatic axon segmentation and visualization system. Two challenging problems have been addressed: the interactive 3D axon segmentation and the interactive 3D image filtering and rendering of implicit surfaces. The reconstruction of neural connections to understand the function of the brain is an emerging and active research area in neuroscience. With the advent of high-resolution scanning technologies, such as 3D light microscopy and electron microscopy (EM), reconstruction of complex 3D neural circuits from large volumes of neural tissues has become feasible. Among them, only EM data can provide sufficient resolution to identify synapses and to resolve extremely narrow neural processes. These high-resolution, large-scale datasets pose challenging problems, for example, how to process and manipulate large datasets to extract scientifically meaningful information using a compact representation in a reasonable processing time. The running time of the multiphase level set segmentation method has been measured on the CPU and GPU. The CPU version is implemented using the ITK image class and the ITK distance transform filter. The numerical part of the CPU implementation is similar to the GPU implementation for fair comparison. The main focus of this chapter is introducing the GPU algorithms and their implementation details, which are the core components of the interactive segmentation and visualization system. © 2011 Copyright © 2011 NVIDIA Corporation and Wen-mei W. Hwu Published by Elsevier Inc. All rights reserved..
Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

International Nuclear Information System (INIS)

Xu Qi; Yu Ganglin; Wang Kan; Sun Jialong

2014-01-01

In this paper, the adaptability of the neutron diffusion numerical algorithm on GPUs was studied, and a GPU-accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. The IAEA 3D PWR benchmark problem was calculated in the numerical test. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. (authors)
A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation

International Nuclear Information System (INIS)

Gu Xuejun; Jia Xun; Jiang, Steve B; Jelen, Urszula; Li Jinsheng

2011-01-01

Targeting at the development of an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite-size pencil beam (FSPB) algorithm with a 3D-density correction method on graphics processing unit (GPU). This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework (Gu et al 2009 Phys. Med. Biol. 54 6287-97). Dosimetric evaluations against Monte Carlo dose calculations are conducted on ten IMRT treatment plans (five head-and-neck cases and five lung cases). For all cases, there is improvement with the 3D-density correction over the conventional FSPB algorithm and for most cases the improvement is significant. Regarding the efficiency, because of the appropriate arrangement of memory access and the usage of GPU intrinsic functions, the dose calculation for an IMRT plan can be accomplished well within 1 s (except for one case) with this new GPU-based FSPB algorithm. Compared to the previous GPU-based FSPB algorithm without 3D-density correction, this new algorithm, though slightly sacrificing the computational efficiency (∼5-15% lower), has significantly improved the dose calculation accuracy, making it more suitable for online IMRT replanning.
Magnetic resonance examinations at two Tesla

International Nuclear Information System (INIS)

Grabbe, E.; Maas, R.; Heller, M.; Denkhaus, H.; Buecheler, E.

1986-01-01

After having used a 2 Tesla prototype whole body scanner for about one and a half years, it is now possible to comment on the clinical value of high field strengths. The methods and techniques employed are described. The problems arising from high field strengths are discussed and their effect on clinical diagnosis is indicated. (orig.) [de
Note: Tesla based pulse generator for electrical breakdown study of liquid dielectrics

Science.gov (United States)

Veda Prakash, G.; Kumar, R.; Patel, J.; Saurabh, K.; Shyam, A.

2013-12-01

In the process of studying charge holding capability and delay time for breakdown in liquids under nanosecond (ns) time scales, a Tesla based pulse generator has been developed. Pulse generator is a combination of Tesla transformer, pulse forming line, a fast closing switch, and test chamber. Use of Tesla transformer over conventional Marx generators makes the pulse generator very compact, cost effective, and requires less maintenance. The system has been designed and developed to deliver maximum output voltage of 300 kV and rise time of the order of tens of nanoseconds. The paper deals with the system design parameters, breakdown test procedure, and various experimental results. To validate the pulse generator performance, experimental results have been compared with PSPICE simulation software and are in good agreement with simulation results.

A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

Science.gov (United States)

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B.; Jia, Xun

2015-09-01

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC).

Science.gov (United States)

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

2015-10-07

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia's CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon-electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783-97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE's random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48-0.53% for the electron beam cases and 0.15-0.17% for the photon beam cases. In terms of efficiency, goMC was ~4-16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was validated by
A GPU OpenCL based cross-platform Monte Carlo dose calculation engine (goMC)

International Nuclear Information System (INIS)

Tian, Zhen; Shi, Feng; Folkerts, Michael; Qin, Nan; Jiang, Steve B; Jia, Xun

2015-01-01

Monte Carlo (MC) simulation has been recognized as the most accurate dose calculation method for radiotherapy. However, the extremely long computation time impedes its clinical application. Recently, a lot of effort has been made to realize fast MC dose calculation on graphic processing units (GPUs). However, most of the GPU-based MC dose engines have been developed under NVidia’s CUDA environment. This limits the code portability to other platforms, hindering the introduction of GPU-based MC simulations to clinical practice. The objective of this paper is to develop a GPU OpenCL based cross-platform MC dose engine named goMC with coupled photon–electron simulation for external photon and electron radiotherapy in the MeV energy range. Compared to our previously developed GPU-based MC code named gDPM (Jia et al 2012 Phys. Med. Biol. 57 7783–97), goMC has two major differences. First, it was developed under the OpenCL environment for high code portability and hence could be run not only on different GPU cards but also on CPU platforms. Second, we adopted the electron transport model used in EGSnrc MC package and PENELOPE’s random hinge method in our new dose engine, instead of the dose planning method employed in gDPM. Dose distributions were calculated for a 15 MeV electron beam and a 6 MV photon beam in a homogenous water phantom, a water-bone-lung-water slab phantom and a half-slab phantom. Satisfactory agreement between the two MC dose engines goMC and gDPM was observed in all cases. The average dose differences in the regions that received a dose higher than 10% of the maximum dose were 0.48–0.53% for the electron beam cases and 0.15–0.17% for the photon beam cases. In terms of efficiency, goMC was ∼4–16% slower than gDPM when running on the same NVidia TITAN card for all the cases we tested, due to both the different electron transport models and the different development environments. The code portability of our new dose engine goMC was
Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs.

Science.gov (United States)

Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

2015-01-01

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n (2)), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k (2) n (2)) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.
The success of the 11-Tesla project and its potential beyond particle physics

CERN Multimedia

CERN Bulletin

2013-01-01

On 7 March, the 1-metre-long single-aperture dipole model magnet under testing at Fermilab reached a current of 12.54 kA corresponding to a bore field of 11.5 Tesla, thus surpassing the goal set for the 11 T dipole project. Computer generated model of the FNAL 1 metre 11 T dipole model magnet and a pair of CERN coils. Image: courtesy of Don Mitchell, FNAL. The 11-Tesla dipole project originated from a proposal made by High Luminosity LHC project coordinator, Lucio Rossi, in September 2010. To cope with the increasing amount of debris hitting the magnets when increasing the number of collisions produced by the LHC, he suggested replacing a few 8-Tesla dipole magnets in the LHC tunnel with shorter, stronger 11-Tesla magnets in order to create enough space to install additional collimators. The only way to achieve this goal is to use advanced niobium-tin technology. Rossi’s proposal aligned well with the goals of Fermilab’s High-Field Magnet R&D programme, which aims t...
Cortical microinfarcts detected in vivo on 3 Tesla MRI: clinical and radiological correlates.

Science.gov (United States)

van Dalen, Jan Willem; Scuric, Eva E M; van Veluw, Susanne J; Caan, Matthan W A; Nederveen, Aart J; Biessels, Geert Jan; van Gool, Willem A; Richard, Edo

2015-01-01

Cortical microinfarcts (CMIs) are a common postmortem finding associated with vascular risk factors, cognitive decline, and dementia. Recently, CMIs identified in vivo on 7 Tesla MRI also proved retraceable on 3 Tesla MRI. We evaluated CMIs on 3 Tesla MRI in a population-based cohort of 194 nondemented older people (72-80 years) with systolic hypertension. Using a case-control design, participants with and without CMIs were compared on age, sex, cardiovascular risk factors, and white matter hyperintensity volume. We identified 23 CMIs in 12 participants (6%). CMIs were associated with older age, higher diastolic blood pressure, and a history of recent stroke. There was a trend for a higher white matter hyperintensity volume in participants with CMIs. We found an association of CMIs with clinical parameters, including age and cardiovascular risk factors. Although the prevalence of CMIs is relatively low, our results suggest that the study of CMIs in larger clinical studies is possible using 3 Tesla MRI. This opens the possibility of large-scale prospective investigation of the clinical relevance of CMIs in older people. © 2014 American Heart Association, Inc.
Prospects for 6 to 10 tesla magnets for a TEVATRON upgrade

International Nuclear Information System (INIS)

Mantsch, Paul M.

1988-01-01

The first SSC physics is at least 10 years away. An upgrade of the Fermilab Tevatron will ensure the continuity of a vigorous high-energy physics program until the SSC turns on. Three basic proposals are under consideration: /bar p/p at 3 /times/ 10 31 --Increase luminosity by improvements to the p source. pp at 1 TeV and 2 /times/ 10 32 --Move the main ring to a new tunnel, build a second Tevatron ring, and /bar p/p > 1.5 TeV and 7 /times/ 10 30 --Replace the tevatron with a higher energy ring. The last two options requires about a hundred 6.6-tesla dipoles in addition to a ring of Tevatron strength (4.4 T) magnets. These higher-field magnets are necessary in both rings to lengthen the straight sections in order to realize the collision optics. The third option requires a ring of magnets of 6.6 T or slightly higher to replace the present Tevatron plus a number of special 8--9 tesla magnets. The viability of the high-energy option then depends on the practicality of sizable numbers of reliable 8--9 tesla dipoles as well as 800 6.6-tesla dipoles. The following develops a specification for an 8.8 T dipole, examines the design considerations and reviews the current state of high-field magnet development. 22 figs., 3 tabs
Status and outlook for high power processing of 1.3 GHz TESLA multicell cavities

International Nuclear Information System (INIS)

Kirchgessner, J.; Barnes, P.; Graber, J.; Metzger, D.; Mofat, D.; Muller, H.; Padamsee, H.; Sears, J.; Tigner, M.; Matheisen, A.

1993-01-01

In order to increase the usable accelerating gradient in Superconducting TESLA cavities, the field emission threshold barrier must be raised. As has been previously demonstrated on S-band cavities, a way to accomplish this is with the use of high peak power RF processing. A transmitter with a peak power of 2 Mwatt and 300 μsec pulse length has been assembled and has been used to process TESLA cavities. Several five cell TESLA cavities at 1.3 GHz have been manufactured for this purpose. This transmitter and the cavities will be described and the results of the tests will be presented
Conceptual design of a 20 Tesla pulsed solenoid for a laser solenoid fusion reactor

International Nuclear Information System (INIS)

Nolan, J.J.; Averill, R.J.

1977-01-01

Design considerations are described for a strip wound solenoid which is pulsed to 20 tesla while immersed in a 20 tesla bias field so as to achieve within the bore of the pulsed solenoid at net field sequence starting at 20 tesla and going first down to zero, then up to 40 tesla, and finally back to 20 tesla in a period of about 5 x 10 -3 seconds. The important parameters of the solenoid, e.g., aperture, build, turns, stored and dissipated energy, field intensity and powering circuit, are given. A numerical example for a specific design is presented. Mechanical stresses in the solenoid and the subsequent choice of materials for coil construction are discussed. Although several possible design difficulties are not discussed in this preliminary report of a conceptual magnet design, such as uniformity of field, long-term stability of insulation under neutron bombardment and choice of structural materials of appropriate tensile strength and elasticity to withstand magnetic forces developed, these questions are addressed in detail in the complete design report and in part in reference one. Furthermore, the authors feel that the problems encountered in this conceptual design are surmountable and are not a hindrance to the construction of such a magnet system
NIKOLA TESLA AND MEDICINE: 160TH ANNIVERSARY OF THE BIRTH OF THE GENIUS WHO GAVE LIGHT TO THE WORLD - PART I.

Science.gov (United States)

Vucevic, Danijela; Dordevic, Drago; Radosavljevic, Tatjana

2016-09-01

The interest in Nikola Tesla, a scientist, physicist, engineer and inventor, is constantly growing. In the millennialong history of human civilization, it is almost impossible to find another person whose life and work has been under so much scrutiny of such a wide range of researchers, medical professionals included. Although Tesla was not primarily dedicated to biomedical research, his work significantly contributed to the development of radiology, and high frequency electrotherapy. This paper deals with the impact of Tesla's work on the development of a new medical branch - radiology. Nikola Tesla and the Discovery of X-ray radiation. Tesla pioneered the use of X-rays for medical purposes, practically laying the foundations of radiology. Namely, since 1887, Tesla periodically experimented with X-rays, at that time still unknown and unnamed, which he called "shadowgraphs". Moreover, at the end of 1894, lie conducted extensive research focusing on X-rays, but unfortunately it was inlerrupted after the fire burning down his laboratory in 1895. In 1896 and 1897, Tesla published ten papers on the biologic effects of X-ray radiation. All his studies on X-rays were experimental. During 1896 and 1897, Tesla continued improving X-ray devices. Apart from this, Tesla was the first to point out the harmful effects of exposure to X-ray radiation on human body. Nikola Tesla was a visionary genius of the future. Tesla's pioneer steps, made more than a century ago in the domain of radiology, are still being used today.
Diagnosis of rotator cuff tears using 3-Tesla MRI versus 3-Tesla MRA: a systematic review and meta-analysis.

Science.gov (United States)

McGarvey, Ciaran; Harb, Ziad; Smith, Christian; Houghton, Russell; Corbett, Steven; Ajuied, Adil

2016-02-01

To compare the diagnostic accuracy of magnetic resonance imaging (MRI), 2-dimensional magnetic resonance arthrogram (MRA) and 3-dimensional isotropic MRA in the diagnosis of rotator cuff tears when performed exclusively at 3-T. A systematic review was undertaken of the Cochrane, MEDLINE and PubMed databases in accordance with the PRISMA guidelines. Studies comparing 3-T MRI or 3-T MRA (index tests) to arthroscopic surgical findings (reference test) were included. Methodological appraisal was performed using QUADAS 2. Pooled sensitivity and specificity were calculated and summary receiver-operating curves generated. Kappa coefficients quantified inter-observer reliability. Fourteen studies comprising 1332 patients were identified for inclusion. Twelve studies were retrospective and there were concerns regarding index test bias and applicability in nine and six studies respectively. Reference test bias was a concern in all studies. Both 3-T MRI and 3-T MRA showed similar excellent diagnostic accuracy for full-thickness supraspinatus tears. Concerning partial-thickness supraspinatus tears, 3-T 2D MRA was significantly more sensitive (86.6 vs. 80.5 %, p = 0.014) but significantly less specific (95.2 vs. 100 %, p Tesla 3D isotropic MRA showed similar accuracy to 3-T conventional 2D MRA. Three-Tesla MRI appeared equivalent to 3-T MRA in the diagnosis of full- and partial-thickness tears, although there was a trend towards greater accuracy in the diagnosis of subscapularis tears with 3-T MRA. Three-Tesla 3D isotropic MRA appears equivalent to 3-T 2D MRA for all types of tears.
4.5 Tesla magnetic field reduces range of high-energy positrons -- Potential implications for positron emission tomography

International Nuclear Information System (INIS)

Wirrwar, A.; Vosberg, H.; Herzog, H.; Halling, H.; Weber, S.; Mueller-Gaertner, H.W.; Forschungszentrum Juelich GmbH

1997-01-01

The authors have theoretically and experimentally investigated the extent to which homogeneous magnetic fields up to 7 Tesla reduce the spatial distance positrons travel before annihilation (positron range). Computer simulations of a noncoincident detector design using a Monte Carlo algorithm calculated the positron range as a function of positron energy and magnetic field strength. The simulation predicted improvements in resolution, defined as full-width at half-maximum (FWHM) of the line-spread function (LSF) for a magnetic field strength up to 7 Tesla: negligible for F-18, from 3.35 mm to 2.73 mm for Ga-68 and from 3.66 mm to 2.68 mm for Rb-82. Also a substantial noise suppression was observed, described by the full-width at tenth-maximum (FWTM) for higher positron energies. The experimental approach confirmed an improvement in resolution for Ga-68 from 3.54 mm at 0 Tesla to 2.99 mm FWHM at 4.5 Tesla and practically no improvement for F-18 (2.97 mm at 0 Tesla and 2.95 mm at 4.5 Tesla). It is concluded that the simulation model is appropriate and that a homogeneous static magnetic field of 4.5 Tesla reduces the range of high-energy positrons to an extent that may improve spatial resolution in positron emission tomography
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

Science.gov (United States)

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
GPU acceleration of Dock6's Amber scoring computation.

Science.gov (United States)

Yang, Hailong; Zhou, Qiongqiong; Li, Bo; Wang, Yongjian; Luan, Zhongzhi; Qian, Depei; Li, Hanlu

2010-01-01

Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs' R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6's (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157-1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture - Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.
Nikola Tesla and medicine: 160th anniversary of the birth of the genius who gave light to the world - Part II

OpenAIRE

Vučević, Danijela; Đorđević, Drago; Radosavljević, Tatjana

2016-01-01

Introduction. Nikola Tesla (1856 - 1943) was a genius inventor and scientist, whose contribution to medicine is remarkable. Part I of this article reviewed special contributions of the world renowned scientist to the establishment of radiology as a new discipline in medicine. This paper deals with the use of Tesla currents in medicine. Tesla Currents in Medicine. Tesla's greatest impact on medicine is his invention of a transformer (Tesla coil) for producing high frequency and high voltage cu...
Novel techniques for 7 tesla breast MRI

NARCIS (Netherlands)

van der Velden, T.A.

2017-01-01

This thesis introduced several new techniques to the field of 7 tesla breast MRI, enabling high field multi-parametric MR imaging and, potentially, patient specific treatment planning. Chapter 2 described the development of a RF coil setup for bilateral breast MR imaging and 31P spectroscopy. This
The Research and Test of Fast Radio Burst Real-time Search Algorithm Based on GPU Acceleration

Science.gov (United States)

Wang, J.; Chen, M. Z.; Pei, X.; Wang, Z. Q.

2017-03-01

In order to satisfy the research needs of Nanshan 25 m radio telescope of Xinjiang Astronomical Observatory (XAO) and study the key technology of the planned QiTai radio Telescope (QTT), the receiver group of XAO studied the GPU (Graphics Processing Unit) based real-time FRB searching algorithm which developed from the original FRB searching algorithm based on CPU (Central Processing Unit), and built the FRB real-time searching system. The comparison of the GPU system and the CPU system shows that: on the basis of ensuring the accuracy of the search, the speed of the GPU accelerated algorithm is improved by 35-45 times compared with the CPU algorithm.
MODERN ELECTRIC CARS OF TESLA MOTORS COMPANY

Directory of Open Access Journals (Sweden)

O. F. Vynakov

2016-08-01

Full Text Available This overview article shows the advantages of a modern electric car as compared with internal combustion cars by the example of the electric vehicles of Tesla Motors Company. It (в смысле- статья describes the history of this firm, provides technical and tactical characteristics of three modifications of electric vehicles produced by Tesla Motors. Modern electric cars are not less powerful than cars with combustion engines both in speed and acceleration amount. They are reliable, economical and safe in operation. With every year the maximum range of an electric car is increasing and its battery charging time is decreasing.Solving the problem of environmental safety, the governments of most countries are trying to encourage people to switch to electric cars by creating subsidy programs, lending and abolition of taxation. Therefore, the advent of an electric vehicle in all major cities of the world is inevitable.
Diagnostic usefulness of 3 tesla MRI of the brain for cushing disease in a child.

Science.gov (United States)

Ono, Erina; Ozawa, Ayako; Matoba, Kaori; Motoki, Takanori; Tajima, Asako; Miyata, Ichiro; Ito, Junko; Inoshita, Naoko; Yamada, Syozo; Ida, Hiroyuki

2011-10-01

It is sometimes difficult to confirm the location of a microadenoma in Cushing disease. Recently, we experienced an 11-yr-old female case of Cushing disease with hyperprolactinemia. She was referred to our hospital because of decrease of height velocity with body weight gain. On admission, she had typical symptoms of Cushing syndrome. Although no pituitary microadenomas were detected on 1.5 Tesla MRI of the brain, endocrinological examinations including IPS and CS sampling were consistent with Cushing disease with hyperprolactinemia. Oral administration of methyrapone instead of neurosurgery was started after discharge, but subsequent 3 Tesla MRI of the brain clearly demonstrated a 3-mm less-enhanced lesion in the left side of the pituitary gland. Finally, transsphenoidal surgery was performed, and a 3.5-mm left-sided microadenoma was resected. Compared with 1.5 Tesla MRI, 3 Tesla MRI offers the advantage of a higher signal to noise ratio (SNR), which provides higher resolution and proper image quality. Therefore, 3 Tesla MRI is a very useful tool to localize microadenomas in Cushing disease in children as well as in adults. It will be the first choice of radiological examinations in suspected cases of Cushing disease.
76 FR 33402 - Tesla Motors, Inc.; Receipt of Petition for Renewal of Temporary Exemption from the Advanced Air...

Science.gov (United States)

2011-06-08

...-0070] Tesla Motors, Inc.; Receipt of Petition for Renewal of Temporary Exemption from the Advanced Air... Protection. SUMMARY: In accordance with the procedures in 49 CFR Part 555, Tesla Motors, Inc., has petitioned... Petition In accordance with 49 U.S.C. 30113 and the procedures in 49 CFR Part 555, Tesla Motors, Inc...

Multi-GPU Development of a Neural Networks Based Reconstructor for Adaptive Optics

Directory of Open Access Journals (Sweden)

Carlos González-Gutiérrez

2018-01-01

Full Text Available Aberrations introduced by the atmospheric turbulence in large telescopes are compensated using adaptive optics systems, where the use of deformable mirrors and multiple sensors relies on complex control systems. Recently, the development of larger scales of telescopes as the E-ELT or TMT has created a computational challenge due to the increasing complexity of the new adaptive optics systems. The Complex Atmospheric Reconstructor based on Machine Learning (CARMEN is an algorithm based on artificial neural networks, designed to compensate the atmospheric turbulence. During recent years, the use of GPUs has been proved to be a great solution to speed up the learning process of neural networks, and different frameworks have been created to ease their development. The implementation of CARMEN in different Multi-GPU frameworks is presented in this paper, along with its development in a language originally developed for GPU, like CUDA. This implementation offers the best response for all the presented cases, although its advantage of using more than one GPU occurs only in large networks.
A GPU Implementation of Local Search Operators for Symmetric Travelling Salesman Problem

Directory of Open Access Journals (Sweden)

Juraj Fosin

2013-06-01

Full Text Available The Travelling Salesman Problem (TSP is one of the most studied combinatorial optimization problem which is significant in many practical applications in transportation problems. The TSP problem is NP-hard problem and requires large computation power to be solved by the exact algorithms. In the past few years, fast development of general-purpose Graphics Processing Units (GPUs has brought huge improvement in decreasing the applications’ execution time. In this paper, we implement 2-opt and 3-opt local search operators for solving the TSP on the GPU using CUDA. The novelty presented in this paper is a new parallel iterated local search approach with 2-opt and 3-opt operators for symmetric TSP, optimized for the execution on GPUs. With our implementation large TSP problems (up to 85,900 cities can be solved using the GPU. We will show that our GPU implementation can be up to 20x faster without losing quality for all TSPlib problems as well as for our CRO TSP problem.
Operator training and requalification at GPU Nuclear

International Nuclear Information System (INIS)

Long, R.L.; Barrett, R.J.; Newton, S.L.

1982-01-01

The operator training and requalification programs at GPU Nuclear's Oyster Creek (650 MWe BWR) and Three Mile Island-1 (776 MWe PWR) nuclear plants have undergone significant revisions since the Three Mile Island-2 accident. This paper describes the Training and Education organization, the expanded training facilities, including basic principle trainers and replica simulators, and the present operator training and requalification programs
Accelerating Dense Linear Algebra on the GPU

DEFF Research Database (Denmark)

Sørensen, Hans Henrik Brandenborg

and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. The target hardware is the most recent NVIDIA Tesla 20-series (Fermi...
Simulation Model solves exact the Enigma named Generating high Voltages and high Frequencies by Tesla Coil

OpenAIRE

Simo Janjanin

2016-01-01

Simulation model of Tesla coil has been successfully completed, and has been verified the procedure and functioning. The literature and documentation for the model were taken from the rich sources, especially the copies of Tesla patents. The oscillating system‟s electrical scheme consists of the voltage supply 220/50 Hz, Fe transformer, capacitor and belonging chosen electrical components, the air gap in the primary Tesla coil (air transformer) and spark gap in the exit of the coil. The inves...
MEASUREMENT OF THE TRANSVERSE BEAM DYNAMICS IN A TESLA-TYPE SUPERCONDUCTING CAVITY

Energy Technology Data Exchange (ETDEWEB)

Halavanau, A. [NICADD, DeKalb; Eddy, N. [Fermilab; Edstrom, D. [Fermilab; Lunin, A. [Fermilab; Piot, P. [NICADD, DeKalb; Ruan, J. [Fermilab; Solyak, N. [Fermilab

2016-09-26

Superconducting linacs are capable of producing intense, ultra-stable, high-quality electron beams that have widespread applications in Science and Industry. Many project are based on the 1.3-GHz TESLA-type superconducting cavity. In this paper we provide an update on a recent experiment aimed at measuring the transfer matrix of a TESLA cavity at the Fermilab Accelerator Science and Technology (FAST) facility. The results are discussed and compared with analytical and numerical simulations.
Fast GPU-based spot extraction for energy-dispersive X-ray Laue diffraction

International Nuclear Information System (INIS)

Alghabi, F.; Schipper, U.; Kolb, A.; Send, S.; Abboud, A.; Pashniak, N.; Pietsch, U.

2014-01-01

This paper describes a novel method for fast online analysis of X-ray Laue spots taken by means of an energy-dispersive X-ray 2D detector. Current pnCCD detectors typically operate at some 100 Hz (up to a maximum of 400 Hz) and have a resolution of 384 × 384 pixels, future devices head for even higher pixel counts and frame rates. The proposed online data analysis is based on a computer utilizing multiple Graphics Processing Units (GPUs), which allow for fast and parallel data processing. Our multi-GPU based algorithm is compliant with the rules of stream-based data processing, for which GPUs are optimized. The paper's main contribution is therefore an alternative algorithm for the determination of spot positions and energies over the full sequence of pnCCD data frames. Furthermore, an improved background suppression algorithm is presented.The resulting system is able to process data at the maximum acquisition rate of 400 Hz. We present a detailed analysis of the spot positions and energies deduced from a prior (single-core) CPU-based and the novel GPU-based data processing, showing that the parallel computed results using the GPU implementation are at least of the same quality as prior CPU-based results. Furthermore, the GPU-based algorithm is able to speed up the data processing by a factor of 7 (in comparison to single-core CPU-based algorithm) which effectively makes the detector system more suitable for online data processing
Sop-GPU: accelerating biomolecular simulations in the centisecond timescale using graphics processors.

Science.gov (United States)

Zhmurov, A; Dima, R I; Kholodov, Y; Barsegov, V

2010-11-01

Theoretical exploration of fundamental biological processes involving the forced unraveling of multimeric proteins, the sliding motion in protein fibers and the mechanical deformation of biomolecular assemblies under physiological force loads is challenging even for distributed computing systems. Using a C(α)-based coarse-grained self organized polymer (SOP) model, we implemented the Langevin simulations of proteins on graphics processing units (SOP-GPU program). We assessed the computational performance of an end-to-end application of the program, where all the steps of the algorithm are running on a GPU, by profiling the simulation time and memory usage for a number of test systems. The ∼90-fold computational speedup on a GPU, compared with an optimized central processing unit program, enabled us to follow the dynamics in the centisecond timescale, and to obtain the force-extension profiles using experimental pulling speeds (v(f) = 1-10 μm/s) employed in atomic force microscopy and in optical tweezers-based dynamic force spectroscopy. We found that the mechanical molecular response critically depends on the conditions of force application and that the kinetics and pathways for unfolding change drastically even upon a modest 10-fold increase in v(f). This implies that, to resolve accurately the free energy landscape and to relate the results of single-molecule experiments in vitro and in silico, molecular simulations should be carried out under the experimentally relevant force loads. This can be accomplished in reasonable wall-clock time for biomolecules of size as large as 10(5) residues using the SOP-GPU package. © 2010 Wiley-Liss, Inc.
Optimization of materials for the parts that compose a Tesla turbine; Otimizacao de materiais para as partes que compoe uma turbina tipo Tesla

Energy Technology Data Exchange (ETDEWEB)

Rocha, Geovana Vilas Boas da, E-mail: geovana_dmp@yahoo.com.br [Universidade Federal de Sao Paulo (UNIFESP), Sao Jose dos Campos, SP (Brazil); Guimaraes, Lamartine N.F.; Placco, Guilherme M., E-mail: guimarae@ieav.cta.br, E-mail: placco@ieav.cta.br [Instituto de Estudos Avancados (IEAv), Sao Jose dos Campos, SP (Brazil). Divisao de Energia Nuclear

2013-07-01

The TERRA project (Tecnologia de Reatores Rapidos Avancados) of the Aeronautica (Brazil) aims to develop the necessary technologies for the design of nuclear microreactors. These, in turn, aim to address the thermal and electrical needs in space vehicles. One of the activities of this project is to build a closed thermal cycle, the Rankine type in order to test a Tesla turbine type developed by the group. In this thermodynamic cycle the water is transformed into steam, which triggers a turbine which, in turn, provide power to the alternator to be converted into electricity. The work presented a survey of the materials available on the national market for machining a Tesla type turbine. The surveys were made considering the characteristics and operating conditions of a specific thermal cycle, the interest of the group. Results: cost-benefit tables for each party of the turbine, characteristics of each material, the machining process, as well as a comparison between one of 304L stainless steel model turbine with a turbine with the selected materials. The results from this study raised the level of sophistication of the research involved the TERRA project, since the study of ideal materials that make up the parts of a Tesla type turbine in a heat cycle is unprecedented.
Tesla's coherent plasma discharge -and- a plan for megavolts at Megahertz

International Nuclear Information System (INIS)

Nichson, J.D.

1987-01-01

In his lecture on Experiments With Alternate Currents of High Potential and High Frequency before the Institute of Electrical Engineers in London (1892), Tesla reports a discharge through a partially evacuated air tube of 1 meter length and 1 inch diameter. It is characterized by the following properties: (1) The filamentary discharge may be locally displaced by a nearby dielectric body or a magnet. (2) When the filament is released, it demonstrates behaviour similar to that of a string which suspends a weight, including the formation of standing waves with distinct nodes. (3) Its decay time is on the order of 8 minutes. (4) The vibrating filament may be split with a magnet to produce two vibrating filaments. (5) This effect could only be formed with a dynamo-driven coil at low air pressures in the tube. The disruptive discharge coil (coloquially a Tesla Coil) failed to produce the effect with its superior voltage and frequency range. It is here proposed that this phenomenon is related to positive leader formation. A model for this, consistent for AC and DC discharges, is advanced. Also, a novel method for regulation of a nitrogen-filled spark gap will be proposed. It is hoped that this new device will produce smooth, uniform discharges from the Tesla Coil. This, if theory is correct on many points, will reproduce Tesla's coherent plasma at higher pressures in free-standing form, and will allow other novel effects
Blocked inverted indices for exact clustering of large chemical spaces.

Science.gov (United States)

Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver

2014-09-22

The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
Algorithms of GPU-enabled reactive force field (ReaxFF) molecular dynamics.

Science.gov (United States)

Zheng, Mo; Li, Xiaoxia; Guo, Li

2013-04-01

Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10-50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.'s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps' C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations. Copyright © 2013 Elsevier Inc. All rights reserved.
Performance of Сellular Automata-based Stream Ciphers in GPU Implementation

Directory of Open Access Journals (Sweden)

P. G. Klyucharev

2016-01-01

Full Text Available Earlier the author had developed methods to build high-performance generalized cellular automata-based symmetric ciphers, which allow obtaining the encryption algorithms that show extremely high performance in hardware implementation. However, their implementation based on the conventional microprocessors lacks high performance. The mere fact is quite common - it shows a scope of applications for these ciphers. Nevertheless, the use of graphic processors enables achieving an appropriate performance for a software implementation.The article is extension of a series of the articles, which study various aspects to construct and implement cryptographic algorithms based on the generalized cellular automata. The article is aimed at studying the capabilities to implement the GPU-based cryptographic algorithms under consideration.Representing a key generator, the implemented encryption algorithm comprises 2k generalized cellular automata. The cellular automata graphs are Ramanujan’s ones. The cells of produced k gamma streams alternate, thereby allowing the GPU capabilities to be better used.To implement was used OpenCL, as the most universal and platform-independent API. The software written in C ++ was designed so that the user could set various parameters, including the encryption key, the graph structure, the local communication function, various constants, etc. To test were used a variety of graphics processors (NVIDIA GTX 650; NVIDIA GTX 770; AMD R9 280X.Depending on operating conditions, and GPU used, a performance range is from 0.47 to 6.61 Gb / s, which is comparable to the performance of the countertypes.Thus, the article has demonstrated that using the GPU makes it is possible to provide efficient software implementation of stream ciphers based on the generalized cellular automata.This work was supported by the RFBR, the project №16-07-00542.
Beam dynamic issues in TESLA damping ring

International Nuclear Information System (INIS)

Shiltsev, V.

1996-05-01

In this paper we study general requirements on impedances of the linear collider TESLA damping ring design. Quantitative consideration is performed for 17-km long ''dog-bone'' ring. Beam dynamics in alternative options of 6.3 and 2.3-km long damping rings is briefly discussed. 5 refs., 2 tabs
Diagnosis of rotator cuff tears using 3-Tesla MRI versus 3-Tesla MRA: a systematic review and meta-analysis

Energy Technology Data Exchange (ETDEWEB)

McGarvey, Ciaran; Harb, Ziad; Smith, Christian; Ajuied, Adil [Guy' s and St Thomas' Hospital, King' s Health Partners, Department of Trauma and Orthopaedics, London (United Kingdom); Houghton, Russell [Guy' s and St Thomas' Hospital, King' s Health Partners, Department of Radiology, London (United Kingdom); Corbett, Steven [Guy' s and St Thomas' Hospital, King' s Health Partners, Department of Trauma and Orthopaedics, London (United Kingdom); Fortius Clinic, London (United Kingdom)

2016-02-15

To compare the diagnostic accuracy of magnetic resonance imaging (MRI), 2-dimensional magnetic resonance arthrogram (MRA) and 3-dimensional isotropic MRA in the diagnosis of rotator cuff tears when performed exclusively at 3-T. A systematic review was undertaken of the Cochrane, MEDLINE and PubMed databases in accordance with the PRISMA guidelines. Studies comparing 3-T MRI or 3-T MRA (index tests) to arthroscopic surgical findings (reference test) were included. Methodological appraisal was performed using QUADAS 2. Pooled sensitivity and specificity were calculated and summary receiver-operating curves generated. Kappa coefficients quantified inter-observer reliability. Fourteen studies comprising 1332 patients were identified for inclusion. Twelve studies were retrospective and there were concerns regarding index test bias and applicability in nine and six studies respectively. Reference test bias was a concern in all studies. Both 3-T MRI and 3-T MRA showed similar excellent diagnostic accuracy for full-thickness supraspinatus tears. Concerning partial-thickness supraspinatus tears, 3-T 2D MRA was significantly more sensitive (86.6 vs. 80.5 %, p = 0.014) but significantly less specific (95.2 vs. 100 %, p < 0.001). There was a trend towards greater accuracy in the diagnosis of subscapularis tears with 3-T MRA. Three-Tesla 3D isotropic MRA showed similar accuracy to 3-T conventional 2D MRA. Three-Tesla MRI appeared equivalent to 3-T MRA in the diagnosis of full- and partial-thickness tears, although there was a trend towards greater accuracy in the diagnosis of subscapularis tears with 3-T MRA. Three-Tesla 3D isotropic MRA appears equivalent to 3-T 2D MRA for all types of tears. (orig.)
Diagnosis of rotator cuff tears using 3-Tesla MRI versus 3-Tesla MRA: a systematic review and meta-analysis

International Nuclear Information System (INIS)

McGarvey, Ciaran; Harb, Ziad; Smith, Christian; Ajuied, Adil; Houghton, Russell; Corbett, Steven

2016-01-01

To compare the diagnostic accuracy of magnetic resonance imaging (MRI), 2-dimensional magnetic resonance arthrogram (MRA) and 3-dimensional isotropic MRA in the diagnosis of rotator cuff tears when performed exclusively at 3-T. A systematic review was undertaken of the Cochrane, MEDLINE and PubMed databases in accordance with the PRISMA guidelines. Studies comparing 3-T MRI or 3-T MRA (index tests) to arthroscopic surgical findings (reference test) were included. Methodological appraisal was performed using QUADAS 2. Pooled sensitivity and specificity were calculated and summary receiver-operating curves generated. Kappa coefficients quantified inter-observer reliability. Fourteen studies comprising 1332 patients were identified for inclusion. Twelve studies were retrospective and there were concerns regarding index test bias and applicability in nine and six studies respectively. Reference test bias was a concern in all studies. Both 3-T MRI and 3-T MRA showed similar excellent diagnostic accuracy for full-thickness supraspinatus tears. Concerning partial-thickness supraspinatus tears, 3-T 2D MRA was significantly more sensitive (86.6 vs. 80.5 %, p = 0.014) but significantly less specific (95.2 vs. 100 %, p < 0.001). There was a trend towards greater accuracy in the diagnosis of subscapularis tears with 3-T MRA. Three-Tesla 3D isotropic MRA showed similar accuracy to 3-T conventional 2D MRA. Three-Tesla MRI appeared equivalent to 3-T MRA in the diagnosis of full- and partial-thickness tears, although there was a trend towards greater accuracy in the diagnosis of subscapularis tears with 3-T MRA. Three-Tesla 3D isotropic MRA appears equivalent to 3-T 2D MRA for all types of tears. (orig.)
Atomic orbital-based SOS-MP2 with tensor hypercontraction. I. GPU-based tensor construction and exploiting sparsity

Energy Technology Data Exchange (ETDEWEB)

Song, Chenchen; Martínez, Todd J. [Department of Chemistry and the PULSE Institute, Stanford University, Stanford, California 94305 (United States); SLAC National Accelerator Laboratory, Menlo Park, California 94025 (United States)

2016-05-07

We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N{sup 2.6} for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).
Atomic orbital-based SOS-MP2 with tensor hypercontraction. I. GPU-based tensor construction and exploiting sparsity.

Science.gov (United States)

Song, Chenchen; Martínez, Todd J

2016-05-07

We present a tensor hypercontracted (THC) scaled opposite spin second order Møller-Plesset perturbation theory (SOS-MP2) method. By using THC, we reduce the formal scaling of SOS-MP2 with respect to molecular size from quartic to cubic. We achieve further efficiency by exploiting sparsity in the atomic orbitals and using graphical processing units (GPUs) to accelerate integral construction and matrix multiplication. The practical scaling of GPU-accelerated atomic orbital-based THC-SOS-MP2 calculations is found to be N(2.6) for reference data sets of water clusters and alanine polypeptides containing up to 1600 basis functions. The errors in correlation energy with respect to density-fitting-SOS-MP2 are less than 0.5 kcal/mol for all systems tested (up to 162 atoms).
Diffusion-weighted whole-body MR imaging with background body signal suppression: a feasibility study at 3.0 Tesla

International Nuclear Information System (INIS)

Muertz, Petra; Krautmacher, Carsten; Traeber, Frank; Schild, Hans H.; Willinek, Winfried A.; Gieseke, Juergen

2007-01-01

The purpose was to provide a diffusion-weighted whole-body magnetic resonance (MR) imaging sequence with background body signal suppression (DWIBS) at 3.0 Tesla. A diffusion-weighted spin-echo echo-planar imaging sequence was combined with the following methods of fat suppression: short TI inversion recovery (STIR), spectral attenuated inversion recovery (SPAIR), and spectral presaturation by inversion recovery (SPIR). Optimized sequences were implemented on a 3.0- and a 1.5-Tesla system and evaluated in three healthy volunteers and six patients with various lesions in the neck, chest, and abdomen on the basis of reconstructed maximum intensity projection images. In one patient with metastases of malignant melanoma, DWIBS was compared with 18 F-fluorodeoxyglucose positron emission tomography (FDG-PET). Good fat suppression for all regions and diagnostic image quality in all cases could be obtained at 3.0 Tesla with the STIR method. In comparison with 1.5 Tesla, DWIBS images at 3.0 Tesla were judged to provide a better lesion-to-bone tissue contrast. However, larger susceptibility-induced image distortions and signal intensity losses, stronger blurring artifacts, and more pronounced motion artifacts degraded the image quality at 3.0 Tesla. A good correlation was found between the metastases as depicted by DWIBS and those as visualized by FDG-PET. DWIBS is feasible at 3.0 Tesla with diagnostic image quality. (orig.)
Convolution of large 3D images on GPU and its decomposition

Science.gov (United States)

Karas, Pavel; Svoboda, David

2011-12-01

In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.