The ATLAS collaboration
2016-01-01
The ATLAS RunTimeTester is a job based software test system. The RunTimeTester runs jobs, and optional tests on the job outputs. Job and test results are reported via a web site. The system currently runs $\\approx$ 8000 jobs daily, and the web site receives $\\approx$ 25K hits a week. This note provides an overview of the system.
Energy Technology Data Exchange (ETDEWEB)
Hong, Tianzhen; Buhl, Fred; Haves, Philip
2008-09-20
EnergyPlus is a new generation building performance simulation program offering many new modeling capabilities and more accurate performance calculations integrating building components in sub-hourly time steps. However, EnergyPlus runs much slower than the current generation simulation programs. This has become a major barrier to its widespread adoption by the industry. This paper analyzed EnergyPlus run time from comprehensive perspectives to identify key issues and challenges of speeding up EnergyPlus: studying the historical trends of EnergyPlus run time based on the advancement of computers and code improvements to EnergyPlus, comparing EnergyPlus with DOE-2 to understand and quantify the run time differences, identifying key simulation settings and model features that have significant impacts on run time, and performing code profiling to identify which EnergyPlus subroutines consume the most amount of run time. This paper provides recommendations to improve EnergyPlus run time from the modeler?s perspective and adequate computing platforms. Suggestions of software code and architecture changes to improve EnergyPlus run time based on the code profiling results are also discussed.
Institute of Scientific and Technical Information of China (English)
李剑慧; 臧斌宇; 吴蓉; 朱传琪
2002-01-01
Parallelizing compilers have made great progress in recent years. However, there still remains a gap between the current ability of parallelizing compilers and their final goals.In order to achieve the maximum parallelism, run-time techniques were used in parallelizing compilers during last few years. First, this paper presents a basic run-time privatization method.The definition of run-time dead code is given and its side effect is discussed. To eliminate the imprecision caused by the run-time dead code, backward data-flow information must be used.Proteus Test, which can use backward information in run-time, is then presented to exploit more dynamic parallelism. Also, a variation of Proteus Test, the Advanced Proteus Test, is offered to achieve partial parallelism. Proteus Test was implemented on the parallelizing compiler AFT.In the end of this paper the program fpppp.f of Spec95fp Benchmark is taken as an example, to show the effectiveness of Proteus Test.
An Ada run-time control architecture for telerobots
Balaram, J.; Rodriguez, G.
1987-01-01
This paper describes the architecture and Ada language implementation of a process-level run-time control subystem for the Jet Propulsion Laboratory (JPL) telerobot system. The concept of run-time control in a combined robot-teleoperation environment is examined and the telerobot system at JPL is described. An Ada language implementation of the JPL Telerobot Run-Time Controller (RTC) is described by highlighting the functional behavior of the subsystem, defining the internal modules, and providing a functional flow time sequence of internal module activity.
Towards Run-time Assurance of Advanced Propulsion Algorithms
Wong, Edmond; Schierman, John D.; Schlapkohl, Thomas; Chicatelli, Amy
2014-01-01
This paper covers the motivation and rationale for investigating the application of run-time assurance methods as a potential means of providing safety assurance for advanced propulsion control systems. Certification is becoming increasingly infeasible for such systems using current verification practices. Run-time assurance systems hold the promise of certifying these advanced systems by continuously monitoring the state of the feedback system during operation and reverting to a simpler, certified system if anomalous behavior is detected. The discussion will also cover initial efforts underway to apply a run-time assurance framework to NASA's model-based engine control approach. Preliminary experimental results are presented and discussed.
Collaborative Simulation Run-time Management Environment Based on HLA
Institute of Scientific and Technical Information of China (English)
王江云; 柴旭东; 王行仁
2002-01-01
The Collaborative Simulation Run-time Management Environment based on HLA (CSRME) mainly focuses on simulation problems for the system design of the complex distributed simulation. CSRME can integrate all the simulation tools and simulation applications that comply with the well-documented interface standards defined by CSRME. CSRME supports both the interoperability of different simulations and the integration of simulation tools, as well as provides simulation run-time management, simulation time management and simulation data management. Finally, the distributed command training system is analyzed and realized to validate the theories of CSRME.
Combining monitoring with run-time assertion checking
Gouw, Stijn de
2013-01-01
We develop a new technique for Run-time Checking for two object-oriented languages: Java and the Abstract Behavioral Specification language ABS. In object-oriented languages, objects communicate by sending each other messages. Assuming encapsulation, the behavior of objects is completely determine
Run-time mapping: dynamic resource allocation in embedded systems
Braak, ter Timon David
2016-01-01
Many desired features of computing platforms, such as increased fault tolerance, variable quality of service, and improved energy efficiency, can be achieved by postponing resource management decisions from design-time to run-time. While multiprocessing has been widespread in embedded systems for q
Run-time mapping: dynamic resource allocation in embedded systems
ter Braak, T.D.
2016-01-01
Many desired features of computing platforms, such as increased fault tolerance, variable quality of service, and improved energy efficiency, can be achieved by postponing resource management decisions from design-time to run-time. While multiprocessing has been widespread in embedded systems for
GPU-centric resolved-particle disperse two-phase flow simulation using the Physalis method
Sierakowski, Adam J.
2016-10-01
We present work on a new implementation of the Physalis method for resolved-particle disperse two-phase flow simulations. We discuss specifically our GPU-centric programming model that avoids all device-host data communication during the simulation. Summarizing the details underlying the implementation of the Physalis method, we illustrate the application of two GPU-centric parallelization paradigms and record insights on how to best leverage the GPU's prioritization of bandwidth over latency. We perform a comparison of the computational efficiency between the current GPU-centric implementation and a legacy serial-CPU-optimized code and conclude that the GPU hardware accounts for run time improvements up to a factor of 60 by carefully normalizing the run times of both codes.
Designing Run-Time Environments to Have Predefined Global Dynamics
Directory of Open Access Journals (Sweden)
Massimo Monti
2013-06-01
Full Text Available The stability and the predictability of a computer network algorithm's performance are as important as themain functional purpose of networking software. However, asserting or deriving such properties from thefinite state machine implementations of protocols is hard and, except for singular cases like TCP, is notdone today. In this paper, we propose to design and study run-time environments for networking protocolswhich inherently enforce desirable, predictable global dynamics. To this end we merge two complementarydesign approaches: (i A design-time and bottom up approach that enables us to engineer algorithms basedon an analyzable (reaction flow model. (ii A run-time and top-down approach based on an autonomousstack composition framework, which switches among implementation alternatives to find optimal operationconfigurations. We demonstrate the feasibility of our self-optimizing system in both simulations and real-world Internet setups.
Papaya Tree Detection with UAV Images Using a GPU-Accelerated Scale-Space Filtering Method
Directory of Open Access Journals (Sweden)
Hao Jiang
2017-07-01
Full Text Available The use of unmanned aerial vehicles (UAV can allow individual tree detection for forest inventories in a cost-effective way. The scale-space filtering (SSF algorithm is commonly used and has the capability of detecting trees of different crown sizes. In this study, we made two improvements with regard to the existing method and implementations. First, we incorporated SSF with a Lab color transformation to reduce over-detection problems associated with the original luminance image. Second, we ported four of the most time-consuming processes to the graphics processing unit (GPU to improve computational efficiency. The proposed method was implemented using PyCUDA, which enabled access to NVIDIA’s compute unified device architecture (CUDA through high-level scripting of the Python language. Our experiments were conducted using two images captured by the DJI Phantom 3 Professional and a most recent NVIDIA GPU GTX1080. The resulting accuracy was high, with an F-measure larger than 0.94. The speedup achieved by our parallel implementation was 44.77 and 28.54 for the first and second test image, respectively. For each 4000 × 3000 image, the total runtime was less than 1 s, which was sufficient for real-time performance and interactive application.
GPU computing and applications
See, Simon
2015-01-01
This book presents a collection of state of the art research on GPU Computing and Application. The major part of this book is selected from the work presented at the 2013 Symposium on GPU Computing and Applications held in Nanyang Technological University, Singapore (Oct 9, 2013). Three major domains of GPU application are covered in the book including (1) Engineering design and simulation; (2) Biomedical Sciences; and (3) Interactive & Digital Media. The book also addresses the fundamental issues in GPU computing with a focus on big data processing. Researchers and developers in GPU Computing and Applications will benefit from this book. Training professionals and educators can also benefit from this book to learn the possible application of GPU technology in various areas.
A GPU-based calculation using the three-dimensional FDTD method for electromagnetic field analysis.
Nagaoka, Tomoaki; Watanabe, Soichi
2010-01-01
Numerical simulations with the numerical human model using the finite-difference time domain (FDTD) method have recently been performed frequently in a number of fields in biomedical engineering. However, the FDTD calculation runs too slowly. We focus, therefore, on general purpose programming on the graphics processing unit (GPGPU). The three-dimensional FDTD method was implemented on the GPU using Compute Unified Device Architecture (CUDA). In this study, we used the NVIDIA Tesla C1060 as a GPGPU board. The performance of the GPU is evaluated in comparison with the performance of a conventional CPU and a vector supercomputer. The results indicate that three-dimensional FDTD calculations using a GPU can significantly reduce run time in comparison with that using a conventional CPU, even a native GPU implementation of the three-dimensional FDTD method, while the GPU/CPU speed ratio varies with the calculation domain and thread block size.
Combining Compile-Time and Run-Time Parallelization
Directory of Open Access Journals (Sweden)
Sungdo Moon
1999-01-01
Full Text Available This paper demonstrates that significant improvements to automatic parallelization technology require that existing systems be extended in two ways: (1 they must combine high‐quality compile‐time analysis with low‐cost run‐time testing; and (2 they must take control flow into account during analysis. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. We present results of measurements on programs from two benchmark suites – SPECFP95 and NAS sample benchmarks – which identify inherently parallel loops in these programs that are missed by the compiler. We characterize remaining parallelization opportunities, and find that most of the loops require run‐time testing, analysis of control flow, or some combination of the two. We present a new compile‐time analysis technique that can be used to parallelize most of these remaining loops. This technique is designed to not only improve the results of compile‐time parallelization, but also to produce low‐cost, directed run‐time tests that allow the system to defer binding of parallelization until run‐time when safety cannot be proven statically. We call this approach predicated array data‐flow analysis. We augment array data‐flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating predicates with array data‐flow values. Predicated array data‐flow analysis allows the compiler to derive “optimistic” data‐flow values guarded by predicates; these predicates can be used to derive a run‐time test guaranteeing the safety of parallelization.
GPU-based Parallel Application Design for Emerging Mobile Devices
Gupta, Kshitij
A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as
An approach to quantifying the run-time behaviour of Java GUI applications
Mitchell, Aine; Power, James F.
2004-01-01
This paper outlines a new technique for collecting dynamic trace information from Java GUI programs. The problems of collecting run-time information from such interactive applications in comparison with traditional batch style execution benchmark programs is outlined. The possible utility of such run-time information is discussed and from this a number of simple run-time metrics are suggested. The metrics results for a small CelsiusConverter Java GUI program are illustrated to ...
Monte Carlo integration on GPU
Kanzaki, J.
2010-01-01
We use a graphics processing unit (GPU) for fast computations of Monte Carlo integrations. Two widely used Monte Carlo integration programs, VEGAS and BASES, are parallelized on GPU. By using $W^{+}$ plus multi-gluon production processes at LHC, we test integrated cross sections and execution time for programs in FORTRAN and C on CPU and those on GPU. Integrated results agree with each other within statistical errors. Execution time of programs on GPU run about 50 times faster than those in C...
Run-time assertion checking of JML annotations in multithreaded applications with e-OpenJML
Kandziora, Jorne; Huisman, Marieke; Bockisch, Christoph; Zaharieva-Stojanovski, Marina; Monahan, R.
2015-01-01
Run-time assertion checking of multithreaded programs is challenging, as assertion evaluation should not interfere with the execution of other threads. This paper describes the prototype implementation of a run-time assertion checker that achieves this by evaluating assertions over snapshots of the
Comparison of Sprint and Run Times with Performance on the Wingate Anaerobic Test.
Tharp, Gerald D.; And Others
1985-01-01
Male volunteers were studied to examine the relationship between the Wingate Anaerobic Test (WAnT) and sprint-run times and to determine the influence of age and weight. Results indicate the WAnT is a moderate predictor of dash and run times but becomes a stronger predictor when adjusted for body weight. (Author/MT)
Run-time Assertion Checking of JML Annotations in Multithreaded Applications with e-OpenJML
Kandziora, Jorne; Huisman, Marieke; Bockisch, Christoph; Zaharieva, M.; Monahan, R.
Run-time assertion checking of multithreaded programs is challenging, as assertion evaluation should not interfere with the execution of other threads. This paper describes the prototype implementation of a run-time assertion checker that achieves this by evaluating assertions over snapshots of the
Strong Normalization by Type-Directed Partial Evaluation and Run-Time Code Generation
DEFF Research Database (Denmark)
Balat, Vincent; Danvy, Olivier
1997-01-01
We investigate the synergy between type-directed partial evaluation and run-time code generation for the Caml dialect of ML. Type-directed partial evaluation maps simply typed, closed Caml values to a representation of their long βη-normal form. Caml uses a virtual machine and has the capability...... to load byte code at run time. Representing the long βη-normal forms as byte code gives us the ability to strongly normalize higher-order values (i.e., weak head normal forms in ML), to compile the resulting strong normal forms into byte code, and to load this byte code all in one go, at run time. We...... conclude this note with a preview of our current work on scaling up strong normalization by run-time code generation to the Caml module language....
Strong normalization by type-directed partial evaluation and run-time code generation
DEFF Research Database (Denmark)
Balat, Vincent; Danvy, Olivier
1998-01-01
We investigate the synergy between type-directed partial evaluation and run-time code generation for the Caml dialect of ML. Type-directed partial evaluation maps simply typed, closed Caml values to a representation of their long βη-normal form. Caml uses a virtual machine and has the capability...... to load byte code at run time. Representing the long βη-normal forms as byte code gives us the ability to strongly normalize higher-order values (i.e., weak head normal forms in ML), to compile the resulting strong normal forms into byte code, and to load this byte code all in one go, at run time. We...... conclude this note with a preview of our current work on scaling up strong normalization by run-time code generation to the Caml module language....
National Aeronautics and Space Administration — The following background technology is described in Part 5: Run-time Verification (RV), White Box Automatic Test Generation (WBATG). Part 5 also describes how WBATG...
CrystalGPU: Transparent and Efficient Utilization of GPU Power
Gharaibeh, Abdullah; Al-Kiswany, Samer; Ripeanu, Matei
2010-01-01
General-purpose computing on graphics processing units (GPGPU) has recently gained considerable attention in various domains such as bioinformatics, databases and distributed computing. GPGPU is based on using the GPU as a co-processor accelerator to offload computationally-intensive tasks from the CPU. This study starts from the observation that a number of GPU features (such as overlapping communication and computation, short lived buffer reuse, and harnessing multi-GPU systems) can be abst...
gpuPOM: a GPU-based Princeton Ocean Model
Directory of Open Access Journals (Sweden)
S. Xu
2014-11-01
Full Text Available Rapid advances in the performance of the graphics processing unit (GPU have made the GPU a compelling solution for a series of scientific applications. However, most existing GPU acceleration works for climate models are doing partial code porting for certain hot spots, and can only achieve limited speedup for the entire model. In this work, we take the mpiPOM (a parallel version of the Princeton Ocean Model as our starting point, design and implement a GPU-based Princeton Ocean Model. By carefully considering the architectural features of the state-of-the-art GPU devices, we rewrite the full mpiPOM model from the original Fortran version into a new Compute Unified Device Architecture C (CUDA-C version. We take several accelerating methods to further improve the performance of gpuPOM, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, and overlapping input/output (I/O between the hybrid Central Processing Unit (CPU and the GPU. Our experimental results indicate that the performance of the gpuPOM on a workstation containing 4 GPUs is comparable to a powerful cluster with 408 CPU cores and it reduces the energy consumption by 6.8 times.
Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method
Januszewski, Michal
2013-01-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes.
3D data denoising via Nonlocal Means filter by using parallel GPU strategies.
Cuomo, Salvatore; De Michele, Pasquale; Piccialli, Francesco
2014-01-01
Nonlocal Means (NLM) algorithm is widely considered as a state-of-the-art denoising filter in many research fields. Its high computational complexity leads researchers to the development of parallel programming approaches and the use of massively parallel architectures such as the GPUs. In the recent years, the GPU devices had led to achieving reasonable running times by filtering, slice-by-slice, and 3D datasets with a 2D NLM algorithm. In our approach we design and implement a fully 3D NonLocal Means parallel approach, adopting different algorithm mapping strategies on GPU architecture and multi-GPU framework, in order to demonstrate its high applicability and scalability. The experimental results we obtained encourage the usability of our approach in a large spectrum of applicative scenarios such as magnetic resonance imaging (MRI) or video sequence denoising.
Methods of Run-Time Error Detection in Distributed Process Control Software
DEFF Research Database (Denmark)
Drejer, N.
of generic run-time error types, design of methods of observing application software behaviorduring execution and design of methods of evaluating run time constraints. In the definition of error types it is attempted to cover all relevant aspects of the application softwaree behavior. Methods of observation...... and constraint evaluation is designed for the modt interesting error types. These include: a) semantical errors in data communicated between application tasks; b) errors in the execution of application tasks; and c) errors in the timing of distributed events emitted by the application software. The design......In this thesis, methods of run-time error detection in application software for distributed process control is designed. The error detection is based upon a monitoring approach in which application software is monitored by system software during the entire execution. The thesis includes definition...
Design Flow Instantiation for Run-Time Reconfigurable Systems: A Case Study
Directory of Open Access Journals (Sweden)
Yang Qu
2007-12-01
Full Text Available Reconfigurable system is a promising alternative to deliver both flexibility and performance at the same time. New reconfigurable technologies and technology-dependent tools have been developed, but a complete overview of the whole design flow for run-time reconfigurable systems is missing. In this work, we present a design flow instantiation for such systems using a real-life application. The design flow is roughly divided into two parts: system level and implementation. At system level, our supports for hardware resource estimation and performance evaluation are applied. At implementation level, technology-dependent tools are used to realize the run-time reconfiguration. The design case is part of a WCDMA decoder on a commercially available reconfigurable platform. The results show that using run-time reconfiguration can save over 40% area when compared to a functionally equivalent fixed system and achieve 30 times speedup in processing time when compared to a functionally equivalent pure software design.
The family of c-bisection auctions: efficiency and running time
Grigorieva Elena; Herings P. Jean-Jacques; Müller Rudolf; Vermeulen Dries
2006-01-01
In this paper we analyze the performance of a recently proposed sequential auction, called the c-bisection auction, that can be used for a sale of a single indivisible object. We discuss the running time and the e±ciency in the ex-post equilibrium of the auction. We show that by changing the parameter c of the auction we can trade o® e±ciency against running time. Moreover, we show that the auction that gives the desired level of e±ciency in expectation takes the same number of rounds for any...
Adaptive Embedded Systems – Challenges of Run-Time Resource Management
DEFF Research Database (Denmark)
Understanding and efficiently controlling the dynamic behavior of adaptive embedded systems is a challenging endavor. The challenges come from the often very complicated interplay between the application, the application mapping, and the underlying hardware architecture. With MPSoC, we have...... the technology to design and fabricate dynamically reconfigurable hardware platforms. However, such platforms will pose new challenges to tools and methods to efficiently explore these platforms at run-time. This talk will address some of the challenges of run-time resource management in adaptive embedded...... systems....
Run-Time and Compiler Support for Programming in Adaptive Parallel Environments
Directory of Open Access Journals (Sweden)
Guy Edjlali
1997-01-01
Full Text Available For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.
Running time supplements: Energy-efficient train control versus robust timetables
Goverde, R.M.P.; Scheepmaker, G.M.
2015-01-01
Energy-efficient train operation is not yet included in the timetable design process in the Netherlands. Hence, running time supplements are not optimally distributed in the timetable. Therefore research has been conducted on the possibilities to better incorporate energy-efficient train operation i
A Compiler and Run-time System for Network Programming Languages
2012-01-01
A Compiler and Run-time System for Network Programming Languages Christopher Monsanto Princeton University Nate Foster Cornell University Rob...Foster, R. Harrison, M. Freedman, C. Monsanto , J. Rexford, A. Story, and D. Walker. Frenetic: A network programming language. In ICFP, Sep 2011. [10] A
Efficient datapath merging for the overhead reduction of run-time reconfigurable systems
Fazlali, M.; Zakerolhosseini, A.; Gaydadjiev, G.
2010-01-01
High latencies in FPGA reconfiguration are known as a major overhead in run-time reconfigurable systems. This overhead can be reduced by merging multiple data flow graphs representing different kernels of the original program into a single (merged) datapath that will be configured less often
Efficient datapath merging for the overhead reduction of run-time reconfigurable systems
Fazlali, M.; Zakerolhosseini, A.; Gaydadjiev, G.
2010-01-01
High latencies in FPGA reconfiguration are known as a major overhead in run-time reconfigurable systems. This overhead can be reduced by merging multiple data flow graphs representing different kernels of the original program into a single (merged) datapath that will be configured less often compare
Implementation of a Learning Design Run-Time Environment for the .LRN Learning Management System
del Cid, Jose Pablo Escobedo; de la Fuente Valentin, Luis; Gutierrez, Sergio; Pardo, Abelardo; Kloos, Carlos Delgado
2007-01-01
The IMS Learning Design specification aims at capturing the complete learning flow of courses, without being restricted to a particular pedagogical model. Such flow description for a course, called a Unit of Learning, must be able to be reproduced in different systems using a so called run-time environment. In the last few years there has been…
Masset, Frédéric
2015-09-01
GFARGO is a GPU version of FARGO. It is written in C and C for CUDA and runs only on NVIDIA’s graphics cards. Though it corresponds to the standard, isothermal version of FARGO, not all functionnalities of the CPU version have been translated to CUDA. The code is available in single and double precision versions, the latter compatible with FERMI architectures. GFARGO can run on a graphics card connected to the display, allowing the user to see in real time how the fields evolve.
GpuCV : a GPU-accelerated framework for image processing and computer vision
ALLUSSE, Yannick; Horain, Patrick; Agarwal, Ankit; Saipriyadarshan, Cindula
2008-01-01
International audience; This paper presents briefly describes the state of the art of accelerating image processing with graphics hardware (GPU) and discusses some of its caveats. Then it describes GpuCV, an open source multi-platform library for GPU-accelerated image processing and Computer Vision operators and applications. It is meant for computer vision scientist not familiar with GPU technologies. GpuCV is designed to be compatible with the popular OpenCV library by offering GPU-accelera...
Energy Technology Data Exchange (ETDEWEB)
2016-07-15
The Kokkos Clang compiler is a version of the Clang C++ compiler that has been modified to perform targeted code generation for Kokkos constructs in the goal of generating highly optimized code and to provide semantic (domain) awareness throughout the compilation toolchain of these constructs such as parallel for and parallel reduce. This approach is taken to explore the possibilities of exposing the developer’s intentions to the underlying compiler infrastructure (e.g. optimization and analysis passes within the middle stages of the compiler) instead of relying solely on the restricted capabilities of C++ template metaprogramming. To date our current activities have focused on correct GPU code generation and thus we have not yet focused on improving overall performance. The compiler is implemented by recognizing specific (syntactic) Kokkos constructs in order to bypass normal template expansion mechanisms and instead use the semantic knowledge of Kokkos to directly generate code in the compiler’s intermediate representation (IR); which is then translated into an NVIDIA-centric GPU program and supporting runtime calls. In addition, by capturing and maintaining the higher-level semantics of Kokkos directly within the lower levels of the compiler has the potential for significantly improving the ability of the compiler to communicate with the developer in the terms of their original programming model/semantics.
GPU Computing Gems Emerald Edition
Hwu, Wen-mei W
2011-01-01
".the perfect companion to Programming Massively Parallel Processors by Hwu & Kirk." -Nicolas Pinto, Research Scientist at Harvard & MIT, NVIDIA Fellow 2009-2010 Graphics processing units (GPUs) can do much more than render graphics. Scientists and researchers increasingly look to GPUs to improve the efficiency and performance of computationally-intensive experiments across a range of disciplines. GPU Computing Gems: Emerald Edition brings their techniques to you, showcasing GPU-based solutions including: Black hole simulations with CUDA GPU-accelerated computation and interactive display of
GPU-accelerated compressive holography.
Endo, Yutaka; Shimobaba, Tomoyoshi; Kakue, Takashi; Ito, Tomoyoshi
2016-04-18
In this paper, we show fast signal reconstruction for compressive holography using a graphics processing unit (GPU). We implemented a fast iterative shrinkage-thresholding algorithm on a GPU to solve the ℓ1 and total variation (TV) regularized problems that are typically used in compressive holography. Since the algorithm is highly parallel, GPUs can compute it efficiently by data-parallel computing. For better performance, our implementation exploits the structure of the measurement matrix to compute the matrix multiplications. The results show that GPU-based implementation is about 20 times faster than CPU-based implementation.
Engel, Wolfgang
2011-01-01
This book focuses on advanced rendering techniques that run on the DirectX and/or OpenGL run-time with any shader language available. It includes articles on the latest and greatest techniques in real-time rendering, including MLAA, adaptive volumetric shadow maps, light propagation volumes, wrinkle animations, and much more. The book emphasizes techniques for handheld programming to reflect the increased importance of graphics on mobile devices. It covers geometry manipulation, effects in image space, shadows, 3D engine design, GPGPU, and graphics-related tools.Source code and other materials
Ammazzalorso, F.; Bednarz, T.; Jelen, U.
2014-03-01
We demonstrate acceleration on graphic processing units (GPU) of automatic identification of robust particle therapy beam setups, minimizing negative dosimetric effects of Bragg peak displacement caused by treatment-time patient positioning errors. Our particle therapy research toolkit, RobuR, was extended with OpenCL support and used to implement calculation on GPU of the Port Homogeneity Index, a metric scoring irradiation port robustness through analysis of tissue density patterns prior to dose optimization and computation. Results were benchmarked against an independent native CPU implementation. Numerical results were in agreement between the GPU implementation and native CPU implementation. For 10 skull base cases, the GPU-accelerated implementation was employed to select beam setups for proton and carbon ion treatment plans, which proved to be dosimetrically robust, when recomputed in presence of various simulated positioning errors. From the point of view of performance, average running time on the GPU decreased by at least one order of magnitude compared to the CPU, rendering the GPU-accelerated analysis a feasible step in a clinical treatment planning interactive session. In conclusion, selection of robust particle therapy beam setups can be effectively accelerated on a GPU and become an unintrusive part of the particle therapy treatment planning workflow. Additionally, the speed gain opens new usage scenarios, like interactive analysis manipulation (e.g. constraining of some setup) and re-execution. Finally, through OpenCL portable parallelism, the new implementation is suitable also for CPU-only use, taking advantage of multiple cores, and can potentially exploit types of accelerators other than GPUs.
Supporting Multiprocessors in the Icecap Safety-Critical Java Run-Time Environment
DEFF Research Database (Denmark)
Zhao, Shuai; Wellings, Andy; Korsholm, Stephan Erbs
2015-01-01
scheduled. As of yet, there is no official Reference Implementation for SCJ. However, the icecap project has produced a Safety-Critical Java Run-time Environment based on the Hardware-near Virtual Machine (HVM). This supports SCJ at all compliance levels and provides an implementation of the safety......-critical Java (javax.safetycritical) package. This is still work-in-progress and lacks certain key features. Among these is the ability to support multiprocessor platforms. In this paper, we explore two possible options to adding multiprocessor support to this environment: the “green thread” and the “native...... thread” approaches. The “native thread” approach is adopted and the design and implementation of a revised icecap SCJ run-time environment is discussed....
Run-time system based on LinSERCANS and Soft-PLC
Institute of Scientific and Technical Information of China (English)
Cunfeng KANG; Chong WANG; Chunmin MA; Xudong HUANG; Renyuan FEI
2009-01-01
Soft-PLC with open architecture is the direction of development in industrial automation. This paper discusses the method of communication between the interface functions of LinSERCANS under RTLinux and the external library of Soft-PLC under Windows. Based on the API HOOK theory, the communication between the interface functions of LinSERCANS and the external libraries of Soft-PLC is set up and it solves the calling functions of dynamic link libraries in different operation systems. It is able to combine LinSERCANS with the Soft-PLC, and a run-time system is developed based on the interface technology of the serial real-time communication system (SERCOS) and technology of soft-PLC. This run-time system has been used in all electronic injection molding machines and works very well.
Worm-hole run-time reconfigurable processor field programmable gate array (FPGA)
1996-01-01
Higher performance is gained through a new architecture which implements a new method of computational resource allocation, utilization and programming based on the concept of Worm-hole Run-Time Reconfiguration (RTR). A stream-driven Worm-hole RTR methodology extends contemporary data-flow paradigms to utilize the dynamic creation of operators and pathways, based upon stream processing in which parcels of data move through custom created pathways and interact with other parcels to achieve the...
Operating Security System Support for Run-Time Security with a Trusted Execution Environment
DEFF Research Database (Denmark)
Gonzalez, Javier
. In this thesis we introduce run-time security primitives that enable a number of trusted services in the context of Linux. These primitives mediate any action involving sensitive data or sensitive assets in order to guarantee their integrity and confidentiality. We introduce a general mechanism to protect...... in the Linux operating system. We are in the process of making this driver part of the mainline Linux kernel....
Maximizing run time performance of deployed data flow graphs on a multiprocessor architecture
Tobias, Richard J.; Hunt, Peter D.
1993-10-01
This paper discusses a practical solution for supporting the deployment of data flow graphs onto the Loral/Rolm Computer Systems, Inc. vector processing multi-processor architecture. It outlines the support software (both workstation hosted and target system hosted) that is required to design, debug, and maximize deployed data flow graph performance on the multiprocessor architecture. The deployment process guarantees real-time deadlines, minimizes run time scheduling overhead, and minimizes designer partitioning input. It is known that determining effective run time data flow graph node schedules for multi-processor architectures is an NP-complete class of problem not well suited to real-time systems. Loral/Rolm Computer Systems, Inc.'s vector processing toolset recognizes this problem and this paper discusses a prescheduling and pre-assignment approach for partitioning data flow graphs to available hardware resources. In particular the toolset components (which are based upon an enhanced data flow graph language) of workstation pre-assignment, prescheduling, run time gross allocation and local compute element dispatching are discussed in detail.
The comparison of OPC performance and run time for dense versus sparse solutions
Abdo, Amr; Stobert, Ian; Viswanathan, Ramya; Burns, Ryan; Herold, Klaus; Kallingal, Chidam; Meiring, Jason; Oberschmidt, James; Mansfield, Scott
2008-03-01
The lithographic processes and resolution enhancement techniques (RET) needed to achieve pattern fidelity are becoming more complicated as the required critical dimensions (CDs) shrink. For technology nodes with smaller devices and tolerances, more complex models and proximity corrections are needed and these significantly increase the computational requirements. New simulation techniques are required to address these computational challenges. The new simulation technique we focus on in this work is dense optical proximity correction (OPC). Sparse OPC tools typically require a laborious, manual and time consuming OPC optimization approach. In contrast, dense OPC uses pixel-based simulation that does not need as much manual setup. Dense OPC was introduced because sparse simulation methodology causes run times to explode as the pattern density increases, since the number of simulation sites in a given optical radius increases. In this work, we completed a comparison of the OPC modeling performance and run time for the dense and the sparse solutions. The analysis found the computational run time to be highly design dependant. The result should lead to the improvement of the quality and performance of the OPC solution and shed light on the pros and cons of using dense versus sparse solution. This will help OPC engineers to decide which solution to apply to their particular situation.
GPU-Powered Coherent Beamforming
Magro, Alessio; Hickish, Jack
2014-01-01
GPU-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimised for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves $\\sim$1.3 TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimised, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.
Kantor, Jiří
2013-01-01
Tato práce popisuje tvorbu jednoduchého raytraceru pro OpenSceneGraph, který běží na grafické kartě. V práci jsou popsány věci, které bylo nutné provést v OpenSceneGraphu, aby bylo možno předávat data do GPU a také několik metod pro hledání průsečíků paprsku a trojúhelníku, což je klíčový algoritmus v raytracingu. This work describes creation of a simple raytracer for OpenSceneGraph, which performs its operations on the graphics card. Things, that needed to be done in OpenSceneGraph in ord...
GPU applications for data processing
Energy Technology Data Exchange (ETDEWEB)
Vladymyrov, Mykhailo, E-mail: mykhailo.vladymyrov@cern.ch [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); Aleksandrov, Andrey [LPI - Lebedev Physical Institute of the Russian Academy of Sciences, RUS-119991 Moscow (Russian Federation); INFN sezione di Napoli, I-80125 Napoli (Italy); Tioukov, Valeri [INFN sezione di Napoli, I-80125 Napoli (Italy)
2015-12-31
Modern experiments that use nuclear photoemulsion imply fast and efficient data acquisition from the emulsion can be performed. The new approaches in developing scanning systems require real-time processing of large amount of data. Methods that use Graphical Processing Unit (GPU) computing power for emulsion data processing are presented here. It is shown how the GPU-accelerated emulsion processing helped us to rise the scanning speed by factor of nine.
A Versatile and Efficient GPU Data Structure for Spatial Indexing
Schneider, Jens
2016-08-10
In this paper we present a novel GPU-based data structure for spatial indexing. Based on Fenwick trees—a special type of binary indexed trees—our data structure allows construction in linear time. Updates and prefixes can be computed in logarithmic time, whereas point queries require only constant time on average. Unlike competing data structures such as summed-area tables and spatial hashing, our data structure requires a constant amount of bits for each data element, and it offers unconstrained point queries. This property makes our data structure ideally suited for applications requiring unconstrained indexing of large data, such as block-storage of large and block-sparse volumes. Finally, we provide asymptotic bounds on both run-time and memory requirements, and we show applications for which our new data structure is useful.
Holistic Context-Sensitivity for Run-Time Optimization of Flexible Manufacturing Systems
Directory of Open Access Journals (Sweden)
Sebastian Scholze
2017-02-01
Full Text Available Highly flexible manufacturing systems require continuous run-time (self- optimization of processes with respect to diverse parameters, e.g., efficiency, availability, energy consumption etc. A promising approach for achieving (self- optimization in manufacturing systems is the usage of the context sensitivity approach based on data streaming from high amount of sensors and other data sources. Cyber-physical systems play an important role as sources of information to achieve context sensitivity. Cyber-physical systems can be seen as complex intelligent sensors providing data needed to identify the current context under which the manufacturing system is operating. In this paper, it is demonstrated how context sensitivity can be used to realize a holistic solution for (self- optimization of discrete flexible manufacturing systems, by making use of cyber-physical systems integrated in manufacturing systems/processes. A generic approach for context sensitivity, based on self-learning algorithms, is proposed aiming at a various manufacturing systems. The new solution encompasses run-time context extractor and optimizer. Based on the self-learning module both context extraction and optimizer are continuously learning and improving their performance. The solution is following Service Oriented Architecture principles. The generic solution is developed and then applied to two very different manufacturing processes.
Energy Efficient Run-Time Incremental Mapping for 3-D Networks-on-Chip
Institute of Scientific and Technical Information of China (English)
Xiao-Hang Wang; Peng Liu; Mei Yang; Maurizio Palesi; Ying-Tao Jiang; Michael C Huang
2013-01-01
3-D Networks-on-Chip (NoC) emerge as a potent solution to address both the interconnection and design complexity problems facing future Multiprocessor System-on-Chips (MPSoCs).Effective run-time mapping on such 3-D NoC-based MPSoCs can be quite challenging,as the arrival order and task graphs of the target applications are typically not known a priori,which can be further complicated by stringent energy requirements for NoC systems.This paper thus presents an energy-aware run-time incremental mapping algorithm (ERIM) for 3-D NoC which can minimize the energy consumption due to the data communications among processor cores,while reducing the fragmentation effect on the incoming applications to be mapped,and simultaneously satisfying the thermal constraints imposed on each incoming application.Specifically,incoming applications are mapped to cuboid tile regions for lower energy consumption of communication and the minimal routing.Fragment tiles due to system fragmentation can be gleaned for better resource utilization.Extensive experiments have been conducted to evaluate the performance of the proposed algorithm ERIM,and the results are compared against the optimal mapping algorithm (branch-and-bound) and two heuristic algorithms (TB and TL).The experiments show that ERIM outperforms TB and TL methods with significant energy saving (more than 10％),much reduced average response time,and improved system utilization.
Holistic Context-Sensitivity for Run-Time Optimization of Flexible Manufacturing Systems.
Scholze, Sebastian; Barata, Jose; Stokic, Dragan
2017-02-24
Highly flexible manufacturing systems require continuous run-time (self-) optimization of processes with respect to diverse parameters, e.g., efficiency, availability, energy consumption etc. A promising approach for achieving (self-) optimization in manufacturing systems is the usage of the context sensitivity approach based on data streaming from high amount of sensors and other data sources. Cyber-physical systems play an important role as sources of information to achieve context sensitivity. Cyber-physical systems can be seen as complex intelligent sensors providing data needed to identify the current context under which the manufacturing system is operating. In this paper, it is demonstrated how context sensitivity can be used to realize a holistic solution for (self-) optimization of discrete flexible manufacturing systems, by making use of cyber-physical systems integrated in manufacturing systems/processes. A generic approach for context sensitivity, based on self-learning algorithms, is proposed aiming at a various manufacturing systems. The new solution encompasses run-time context extractor and optimizer. Based on the self-learning module both context extraction and optimizer are continuously learning and improving their performance. The solution is following Service Oriented Architecture principles. The generic solution is developed and then applied to two very different manufacturing processes.
Jeong, Wonki
2011-01-01
This chapter introduces a GPU-accelerated interactive, semiautomatic axon segmentation and visualization system. Two challenging problems have been addressed: the interactive 3D axon segmentation and the interactive 3D image filtering and rendering of implicit surfaces. The reconstruction of neural connections to understand the function of the brain is an emerging and active research area in neuroscience. With the advent of high-resolution scanning technologies, such as 3D light microscopy and electron microscopy (EM), reconstruction of complex 3D neural circuits from large volumes of neural tissues has become feasible. Among them, only EM data can provide sufficient resolution to identify synapses and to resolve extremely narrow neural processes. These high-resolution, large-scale datasets pose challenging problems, for example, how to process and manipulate large datasets to extract scientifically meaningful information using a compact representation in a reasonable processing time. The running time of the multiphase level set segmentation method has been measured on the CPU and GPU. The CPU version is implemented using the ITK image class and the ITK distance transform filter. The numerical part of the CPU implementation is similar to the GPU implementation for fair comparison. The main focus of this chapter is introducing the GPU algorithms and their implementation details, which are the core components of the interactive segmentation and visualization system. © 2011 Copyright © 2011 NVIDIA Corporation and Wen-mei W. Hwu Published by Elsevier Inc. All rights reserved..
Randomized selection on the GPU
Energy Technology Data Exchange (ETDEWEB)
Monroe, Laura Marie [Los Alamos National Laboratory; Wendelberger, Joanne R [Los Alamos National Laboratory; Michalak, Sarah E [Los Alamos National Laboratory
2011-01-13
We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.
Distributed GPU Computing in GIScience
Jiang, Y.; Yang, C.; Huang, Q.; Li, J.; Sun, M.
2013-12-01
Geoscientists strived to discover potential principles and patterns hidden inside ever-growing Big Data for scientific discoveries. To better achieve this objective, more capable computing resources are required to process, analyze and visualize Big Data (Ferreira et al., 2003; Li et al., 2013). Current CPU-based computing techniques cannot promptly meet the computing challenges caused by increasing amount of datasets from different domains, such as social media, earth observation, environmental sensing (Li et al., 2013). Meanwhile CPU-based computing resources structured as cluster or supercomputer is costly. In the past several years with GPU-based technology matured in both the capability and performance, GPU-based computing has emerged as a new computing paradigm. Compare to traditional computing microprocessor, the modern GPU, as a compelling alternative microprocessor, has outstanding high parallel processing capability with cost-effectiveness and efficiency(Owens et al., 2008), although it is initially designed for graphical rendering in visualization pipe. This presentation reports a distributed GPU computing framework for integrating GPU-based computing within distributed environment. Within this framework, 1) for each single computer, computing resources of both GPU-based and CPU-based can be fully utilized to improve the performance of visualizing and processing Big Data; 2) within a network environment, a variety of computers can be used to build up a virtual super computer to support CPU-based and GPU-based computing in distributed computing environment; 3) GPUs, as a specific graphic targeted device, are used to greatly improve the rendering efficiency in distributed geo-visualization, especially for 3D/4D visualization. Key words: Geovisualization, GIScience, Spatiotemporal Studies Reference : 1. Ferreira de Oliveira, M. C., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. Visualization and Computer Graphics, IEEE
Supporting Multiprocessors in the Icecap Safety-Critical Java Run-Time Environment
DEFF Research Database (Denmark)
Zhao, Shuai; Wellings, Andy; Korsholm, Stephan Erbs
2015-01-01
The current version of the Safety Critical Java (SCJ) specification defines three compliance levels. Level 0 targets single processor programs while Level 1 and 2 can support multiprocessor platforms. Level 1 programs must be fully partitioned but Level 2 programs can also be more globally...... scheduled. As of yet, there is no official Reference Implementation for SCJ. However, the icecap project has produced a Safety-Critical Java Run-time Environment based on the Hardware-near Virtual Machine (HVM). This supports SCJ at all compliance levels and provides an implementation of the safety......-critical Java (javax.safetycritical) package. This is still work-in-progress and lacks certain key features. Among these is the ability to support multiprocessor platforms. In this paper, we explore two possible options to adding multiprocessor support to this environment: the “green thread” and the “native...
The Trick Simulation Toolkit: A NASA/Opensource Framework for Running Time Based Physics Models
Penn, John M.
2016-01-01
The Trick Simulation Toolkit is a simulation development environment used to create high fidelity training and engineering simulations at the NASA Johnson Space Center and many other NASA facilities. Its purpose is to generate a simulation executable from a collection of user-supplied models and a simulation definition file. For each Trick-based simulation, Trick automatically provides job scheduling, numerical integration, the ability to write and restore human readable checkpoints, data recording, interactive variable manipulation, a run-time interpreter, and many other commonly needed capabilities. This allows simulation developers to concentrate on their domain expertise and the algorithms and equations of their models. Also included in Trick are tools for plotting recorded data and various other supporting utilities and libraries. Trick is written in C/C++ and Java and supports both Linux and MacOSX computer operating systems. This paper describes Trick's design and use at NASA Johnson Space Center.
Radionuclide inventories for short run-time space nuclear reactor systems
Coats, Richard L.
1993-01-01
Space Nuclear Reactor Systems, especially those used for propulsion, often have expected operation run times much shorter than those for land-based nuclear power plants. This produces substantially different radionuclide inventories to be considered in the safety analyses of space nuclear systems. This presentation describes an analysis utilizing ORIGEN2 and DKPOWER to provide comparisons among representative land-based and space systems. These comparisons enable early, conceptual considerations of safety issues and features in the preliminary design phases of operational systems, test facilities, and operations by identifying differences between the requirements for space systems and the established practice for land-based power systems. Early indications are that separation distance is much more effective as a safety measure for space nuclear systems than for power reactors because greater decay of the radionuclide activity occurs during the time to transport the inventory a given distance. In addition, the inventories of long-lived actinides are very low for space reactor systems.
Long-Running-Time (T{=}0.45 K) Germanium Bolometer for Far Infrared Spectroscopy
Satoh, Naoki; Tanaka, Yasumoto; Nagasaka, Keigo
1990-01-01
A long-running-time (T{=}0.45 K) germanium bolometer which has a compact charcoal adsorption pump with a novel 3He gas condenser has been constructed. The refrigerator provides continuous cooling of the bolometer element at 0.45 K for 24-hour measurements of spectra in the range 2 to 40 cm-1. Utilizing this bolometer system, transmission spectroscopy has been carried out successively, maintaining the temperature of the sample below 40 K and that of the bolometer element below 1.5 K without a thermal cycle. This experimental setup is essential for obtaining a reproducible spectrum of MEM(TCNQ)2. Thus, each resultant spectrum has good reproducibility even after one-week-long experiments.
An efficient UNICOS run-time system for functional programs on a CRAY X-MP
Energy Technology Data Exchange (ETDEWEB)
Hammes, J.P.; Yantis, B.; Michelsen, R.; Fasel, J.H. III; Fasel, P.K. (Los Alamos National Lab., NM (USA))
1989-01-01
The ParaGraph (Parallel Graph reduction) project at Los Alamos National Laboratory is an investigation of the use of functional languages for scientific applications; efficient implementation strategies are a primary concern. The initial version of the ParaGraph run-time system, ParaGraph-RTS 1.0, was not targeted for the X-MP architecture specifically. We examine the new Cray X-MP UNICOS single processor implementation, ParaGraph-RTS 2.0, and describe the CRAY specific optimizations employed. Optimizations include novel use of the CRAY X-MP register set and the generation of optimized assembly code. Performance improvements over ParaGraph-RTS 1.0 have been significant; initial measurements are presented. 8 refs., 4 figs.
Cadence selection affects metabolic responses during cycling and subsequent running time to fatigue.
Vercruyssen, F; Suriano, R; Bishop, D; Hausswirth, C; Brisswalter, J
2005-05-01
To investigate the effect of cadence selection during the final minutes of cycling on metabolic responses, stride pattern, and subsequent running time to fatigue. Eight triathletes performed, in a laboratory setting, two incremental tests (running and cycling) to determine peak oxygen uptake (VO2PEAK) and the lactate threshold (LT), and three cycle-run combinations. During the cycle-run sessions, subjects completed a 30 minute cycling bout (90% of LT) at (a) the freely chosen cadence (FCC, 94 (5) rpm), (b) the FCC during the first 20 minutes and FCC-20% during the last 10 minutes (FCC-20%, 74 (3) rpm), or (c) the FCC during the first 20 minutes and FCC+20% during the last 10 minutes (FCC+20%, 109 (5) rpm). After each cycling bout, running time to fatigue (Tmax) was determined at 85% of maximal velocity. A significant increase in Tmax was found after FCC-20% (894 (199) seconds) compared with FCC and FCC+20% (651 (212) and 624 (214) seconds respectively). VO2, ventilation, heart rate, and blood lactate concentrations were significantly reduced after 30 minutes of cycling at FCC-20% compared with FCC+20%. A significant increase in VO2 was reported between the 3rd and 10th minute of all Tmax sessions, without any significant differences between sessions. Stride pattern and metabolic variables were not significantly different between Tmax sessions. The increase in Tmax after FCC-20% may be associated with the lower metabolic load during the final minutes of cycling compared with the other sessions. However, the lack of significant differences in metabolic responses and stride pattern between the run sessions suggests that other mechanisms, such as changes in muscular activity, probably contribute to the effects of cadence variation on Tmax
DEFF Research Database (Denmark)
Kristensen, C.H.
a higher confidence in the system behaviour. We have proposed a combination of formal methods and supplemental fault-detection techniques which we call the Complementary Run-Time Evaluation Model. The basic idea in this model is to use the means of verification given by formal methods, to prove......This thesis advocates a formal approach to run-time evaluation of real-time behaviour in distributed process sontrol systems, motivated by a growing interest in applying the increasingly popular formal methods in the application area of distributed process control systems. We propose to evaluate...... the various models underlaying every formal method by declaring the design assumptions as a number of features or constraints, stated in the formal specification of system requirements, to be evaluated at run-time. It is assumed that if these constraints are ful-filled at run-time then it is fair to have...
Considerations for GPU SEE Testing
Wyrwas, Edward J.
2017-01-01
This presentation will discuss the considerations an engineer should take to perform Single Event Effects (SEE) testing on GPU devices. Notable topics will include setup complexity, architecture insight which permits cross platform normalization, acquiring a reasonable detail of information from the test suite, and a few lessons learned from preliminary testing.
Travel Software using GPU Hardware
Szalwinski, Chris M; Dimov, Veliko Atanasov; CERN. Geneva. ATS Department
2015-01-01
Travel is the main multi-particle tracking code being used at CERN for the beam dynamics calculations through hadron and ion linear accelerators. It uses two routines for the calculation of space charge forces, namely, rings of charges and point-to-point. This report presents the studies to improve the performance of Travel using GPU hardware. The studies showed that the performance of Travel with the point-to-point simulations of space-charge effects can be speeded up at least 72 times using current GPU hardware. Simple recompilation of the source code using an Intel compiler can improve performance at least 4 times without GPU support. The limited memory of the GPU is the bottleneck. Two algorithms were investigated on this point: repeated computation and tiling. The repeating computation algorithm is simpler and is the currently recommended solution. The tiling algorithm was more complicated and degraded performance. Both build and test instructions for the parallelized version of the software are inclu...
GASPRNG: GPU accelerated scalable parallel random number generator library
Gao, Shuang; Peterson, Gregory D.
2013-04-01
workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070). Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX. Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives. RAM: 512 MB˜ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory) Classification: 4.13, 6.5. Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs). Solution method: Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs. Running time: The tests provided take a few minutes to run.
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment
Directory of Open Access Journals (Sweden)
Qi Liu
2016-08-01
Full Text Available Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks’ execution time can be improved, in particular for some regular jobs.
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment
Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel
2016-01-01
Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks’ execution time can be improved, in particular for some regular jobs. PMID:27589753
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment.
Liu, Qi; Cai, Weidong; Jin, Dandan; Shen, Jian; Fu, Zhangjie; Liu, Xiaodong; Linge, Nigel
2016-08-30
Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks' execution time can be improved, in particular for some regular jobs.
A Flexible System Level Design Methodology Targeting Run-Time Reconfigurable FPGAs
Directory of Open Access Journals (Sweden)
Houzet Dominique
2007-01-01
Full Text Available Abstract Reconfigurable computing is certainly one of the most important emerging research topics on digital processing architectures over the last few years. The introduction of run-time reconfiguration (RTR on FPGAs requires appropriate design flows and methodologies to fully exploit this new functionality. For that purpose, we present an automatic design generation methodology for heterogeneous architectures based on DSPs and FPGAs that ease and speed RTR implementation. We focus on how to take into account specificities of partially reconfigurable components from a high-level specification during the design generation steps. This method automatically generates designs for both fixed and partially reconfigurable parts of an FPGA with automatic management of the reconfiguration process. Furthermore, this automatic design generation enables a reconfiguration prefetching technique to minimize reconfiguration latency and buffer-merging techniques to minimize memory requirements of the generated design. This concept has been applied to different wireless access schemes, based on a combination of OFDM and CDMA techniques. This implementation example illustrates the benefits of the proposed design methodology.
Determination of production run time and warranty length under system maintenance and trade credits
Tsao, Yu-Chung
2012-12-01
Manufacturers offer a warranty period within which they will fix failed products at no cost to customers. Manufacturers also perform system maintenance when a system is in an out-of-control state. Suppliers provide a credit period to settle the payment to manufacturers. This study considers manufacturer's production and warranty decisions for an imperfect production system under system maintenance and trade credit. Specifically, this study uses the economic production quantity to model the decisions under system maintenance and trade credit. These decisions involve how long the production run time and warranty length should be to maximise total profit. This study provides lemmas for the conditions of optimality and develops a theorem and an algorithm for solving the problems described. Numerical examples illustrate the solution procedures and provide a variety of managerial implications. Results show that simultaneously determining production and warranty decisions is superior to only determining production. This study also discusses the effects of the related parameters on manufacturer's decisions and profits. The results of this study are a useful reference for managerial decision-making and administration.
An MPSoC-Based QAM Modulation Architecture with Run-Time Load-Balancing
Directory of Open Access Journals (Sweden)
Doumenis Demosthenes
2011-01-01
Full Text Available QAM is a widely used multilevel modulation technique, with a variety of applications in data radio communication systems. Most existing implementations of QAM-based systems use high levels of modulation in order to meet the high data rate constraints of emerging applications. This work presents the architecture of a highly parallel QAM modulator, using MPSoC-based design flow and design methodology, which offers multirate modulation. The proposed MPSoC architecture is modular and provides dynamic reconfiguration of the QAM utilizing on-chip interconnection networks, offering high data rates (more than 1 Gbps, even at low modulation levels (16-QAM. Furthermore, the proposed QAM implementation integrates a hardware-based resource allocation algorithm that can provide better throughput and fault tolerance, depending on the on-chip interconnection network congestion and run-time faults. Preliminary results from this work have been published in the Proceedings of the 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010. The current version of the work includes a detailed description of the proposed system architecture, extends the results significantly using more test cases, and investigates the impact of various design parameters. Furthermore, this work investigates the use of the hardware resource allocation algorithm as a graceful degradation mechanism, providing simulation results about the performance of the QAM in the presence of faulty components.
The optimal production-run time for a stock-dependent imperfect production process
Directory of Open Access Journals (Sweden)
Jain Divya
2013-01-01
Full Text Available This paper develops an inventory model for a hypothesized volume flexible manufacturing system in which the production rate is stock-dependent and the system produces both perfect and imperfect quality items. The demand rate of perfect quality items is known and constant, whereas the demand rate of imperfect (non-conforming to specifications quality items is a function of discount offered in the selling price. In this paper, we determine an optimal production-run time and the optimal discount that should be offered in the selling price to influence the sale of imperfect quality items produced by the manufacturing system. The considered model aims to maximize the net profit obtained through the sales of both perfect and imperfect quality items subject to certain constraints of the system. The solution procedure suggests the use of ‘Interior Penalty Function Method’ to solve the associated constrained maximization problem. Finally, a numerical example demonstrating the applicability of proposed model has been included.
Parallelization of MODFLOW using a GPU library.
Ji, Xiaohui; Li, Dandan; Cheng, Tangpei; Wang, Xu-Sheng; Wang, Qun
2014-01-01
A new method based on a graphics processing unit (GPU) library is proposed in the paper to parallelize MODFLOW. Two programs, GetAb_CG and CG_GPU, have been developed to reorganize the equations in MODFLOW and solve them with the GPU library. Experimental tests using the NVIDIA Tesla C1060 show that a 1.6- to 10.6-fold speedup can be achieved for models with more than 10(5) cells. The efficiency can be further improved by using up-to-date GPU devices.
GPU PRO 3 Advanced rendering techniques
Engel, Wolfgang
2012-01-01
GPU Pro3, the third volume in the GPU Pro book series, offers practical tips and techniques for creating real-time graphics that are useful to beginners and seasoned game and graphics programmers alike. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Wessam Bahnassi, and Sebastien St-Laurent have once again brought together a high-quality collection of cutting-edge techniques for advanced GPU programming. With contributions by more than 50 experts, GPU Pro3: Advanced Rendering Techniques covers battle-tested tips and tricks for creating interesting geometry, realistic sha
Cosmological Calculations on the GPU
Bard, Deborah; Allen, Mark T; Yepremyan, Hasmik; Kratochvil, Jan M
2012-01-01
Cosmological measurements require the calculation of nontrivial quantities over large datasets. The next generation of survey telescopes (such as DES, PanSTARRS, and LSST) will yield measurements of billions of galaxies. The scale of these datasets, and the nature of the calculations involved, make cosmological calculations ideal models for implementation on graphics processing units (GPUs). We consider two cosmological calculations, the two-point angular correlation function and the aperture mass statistic, and aim to improve the calculation time by constructing code for calculating them on the GPU. Using CUDA, we implement the two algorithms on the GPU and compare the calculation speeds to comparable code run on the CPU. We obtain a code speed-up of between 10 - 180x faster, compared to performing the same calculation on the CPU. The code has been made publicly available.
Mehrenberger, M; Marradi, L; Crouseilles, N; Sonnendrucker, E; Afeyan, B
2013-01-01
This work concerns the numerical simulation of the Vlasov-Poisson set of equations using semi- Lagrangian methods on Graphical Processing Units (GPU). To accomplish this goal, modifications to traditional methods had to be implemented. First and foremost, a reformulation of semi-Lagrangian methods is performed, which enables us to rewrite the governing equations as a circulant matrix operating on the vector of unknowns. This product calculation can be performed efficiently using FFT routines. Second, to overcome the limitation of single precision inherent in GPU, a {\\delta}f type method is adopted which only needs refinement in specialized areas of phase space but not throughout. Thus, a GPU Vlasov-Poisson solver can indeed perform high precision simulations (since it uses very high order reconstruction methods and a large number of grid points in phase space). We show results for rather academic test cases on Landau damping and also for physically relevant phenomena such as the bump on tail instability and t...
GPU Accelerated Vector Median Filter
Aras, Rifat; Shen, Yuzhong
2011-01-01
Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
DEFF Research Database (Denmark)
Debois, Søren; Hildebrandt, Thomas; Slaats, Tijs
2015-01-01
and verification of flexible, run-time adaptable process-aware information systems, moved into practice via the Dynamic Condition Response (DCR) Graphs notation co-developed with our industrial partner. Our key contributions are: (1) A formal theory of dynamic sub-process instantiation for declarative, event......- responding to DCR Graphs (without dynamic sub-process instantiation) characterises exactly the languages that are the union of a regular and an omega-regular language; (3) a formalisation of run-time refinement and adaptation by composition for DCR* processes and a proof that such refinement is undecidable......We study modularity, run-time adaptation and refinement under safety and liveness constraints in event-based process models with dynamic sub-process instantiation. The study is part of a larger programme to provide semantically well-founded technologies for modelling, implementation...
DEFF Research Database (Denmark)
Debois, Søren; Hildebrandt, Thomas; Slaats, Tijs
2015-01-01
and verification of flexible, run-time adaptable process-aware information systems, moved into practice via the Dynamic Condition Response (DCR) Graphs notation co-developed with our industrial partner. Our key contributions are: (1) A formal theory of dynamic sub-process instantiation for declarative, event...... in general; and finally (4) a decidable and practically useful sub-class of run-time refinements. Our results are illustrated by a running example inspired by a recent Electronic Case Management solution based on DCR Graphs and delivered by our industrial partner. An online prototype implementation...
Developing a run-time coupling between ESP-r and TRNSYS
Jost, Romain
Rigorous modeling is essential to design buildings and deliver the next advances in energy efficiency and on-site renewable energy production. A great variety of energy simulation programs exists but they are, for the most part, specialized in one particular domain and they do not allow a complete analysis. Because all domains (heating, cooling, ventilation, lighting, acoustic) are interconnected and there is no global simulation environment existing that covers all of the system particularities with the same flexibility, it is often appropriate to proceed with software combination and/or coupling. This Master thesis describes the implementation of a run-time coupling between TRNSYS and ESP-r. In order to minimize the modifications to the source codes and create a tool able to support future development of each program, new components that receive and pass data to the other program were implemented in the two software programs. A multi DLL structure enables the coupling and exchange of information. A third piece of software, the Harmonizer, launches TRNSYS and ESP-r DLLS and manages the exchange of data. It is also responsible of the convergence handling and controls that both programs march through time together time step after time step. A new category of components, the Data Exchanger Types was implemented in TRNSYS. These components can work as standard TRNSYS Types and exchange data through their inputs and outputs but they can also impose the solver to continue iterating. This capability is essential to force TRNSYS to do more calculations at a specific time step when it has converged but co-simulation convergence requires more iterations. A component of this new category, Type 130, was created specifically for the coupling with ESP-r. Type 130 exchanges data with the Harmonizer on one side and with the TRNSYS network of Types on the other side. Testing of basic data exchange validates the data exchange method and the coupling. The co-simulator is able to
GPU computing of compressible flow problems by a meshless method with space-filling curves
Ma, Z. H.; Wang, H.; Pu, S. H.
2014-04-01
A graphic processing unit (GPU) implementation of a meshless method for solving compressible flow problems is presented in this paper. Least-square fit is used to discretize the spatial derivatives of Euler equations and an upwind scheme is applied to estimate the flux terms. The compute unified device architecture (CUDA) C programming model is employed to efficiently and flexibly port the meshless solver from CPU to GPU. Considering the data locality of randomly distributed points, space-filling curves are adopted to re-number the points in order to improve the memory performance. Detailed evaluations are firstly carried out to assess the accuracy and conservation property of the underlying numerical method. Then the GPU accelerated flow solver is used to solve external steady flows over aerodynamic configurations. Representative results are validated through extensive comparisons with the experimental, finite volume or other available reference solutions. Performance analysis reveals that the running time cost of simulations is significantly reduced while impressive (more than an order of magnitude) speedups are achieved.
Franschman, G.; Verburg, N.; Brens-Heldens, V.; Andriessen, T. M. J. C.; Van der Naalt, J.; Peerdeman, S. M.; Hoogerwerf, N.; Greuters, S.; Schober, P.; Vos, P. E.; Christiaans, H. M. T.; Boer, C.; Valk, J.P.
2012-01-01
Introduction: Prehospital care by physician-based helicopter emergency medical services (P-HEMS) may prolong total prehospital run time. This has raised an issue of debate about the benefits of these services in traumatic brain injury (TBI). We therefore investigated the effects of P-HEMS dispatch o
Franschman, G.; Verburg, N.; Brens-Heldens, V.; Andriessen, T.M.J.C.; Naalt, J. van der; Peerdeman, S.M.; Valk, J.P.M. van der; Hoogerwerf, N.; Greuters, S.; Schober, P.; Vos, P.E.; Christiaans, H.M.; Boer, C.
2012-01-01
INTRODUCTION: Prehospital care by physician-based helicopter emergency medical services (P-HEMS) may prolong total prehospital run time. This has raised an issue of debate about the benefits of these services in traumatic brain injury (TBI). We therefore investigated the effects of P-HEMS dispatch o
Directory of Open Access Journals (Sweden)
Zhiwen Liu
2012-07-01
Full Text Available Performance degradation assessment based on condition monitoring plays an important role in ensuring reliable operation of equipment, reducing production downtime and saving maintenance costs, yet performance degradation has strong fuzziness, and the dynamic information is random and fuzzy, making it a challenge how to assess the fuzzy bearing performance degradation. This study proposes a monotonic degradation assessment index of rolling bearings using fuzzy support vector data description (FSVDD and running time. FSVDD constructs the fuzzy-monitoring coefficient which is sensitive to the initial defect and stably increases as faults develop. Moreover, the parameter describes the accelerating relationships between the damage development and running time. However, the index with an oscillating trend disagrees with the irreversible damage development. The running time is introduced to form a monotonic index, namely damage severity index (DSI. DSI inherits all advantages of and overcomes its disadvantage. A run-to-failure test is carried out to validate the performance of the proposed method. The results show that DSI reflects the growth of the damages with running time perfectly.
Pop, V.; Bergveld, H.J.; Notten, P.H.L.; Op het Veld, J.H.G.; Regtien, Paulus P.L.
2008-01-01
This paper describes the various error sources in a real-time State-of-Charge (SoC) evaluation system and their effects on the overall accuracy in the calculation of the remaining run-time of a battery-operated system. The SoC algorithm for Li-ion batteries studied in this paper combines direct meas
Gouw, C.P.T. de; Boer, F.S. de; Johnsen, E.B.; Kohn, A.; Wong, P.Y.H.
2014-01-01
Run-time assertion checking is one of the useful techniques for detecting faults, and can be applied during any program execution context, including debugging, testing, and production. In general, however, it is limited to checking state-based properties. We introduce SAGA, a general framework that
On the Interplay between the Semantics of Java's Finally Clauses and the JML Run-Time Checker
Huisman, M.; Banerjee, A.
2009-01-01
This paper discusses how a subtle interaction between the semantics of Java and the implementation of the JML run-time checker can cause the latter to fail to report errors. This problem is due to the well-known capability of finally clauses to implicitly override exceptions. We give some simple ex
Pop, V.; Bergveld, H.J.; Notten, P.H.L.; Op het Veld, J.H.G.; Regtien, Paulus P.L.
2008-01-01
This paper describes the various error sources in a real-time State-of-Charge (SoC) evaluation system and their effects on the overall accuracy in the calculation of the remaining run-time of a battery-operated system. The SoC algorithm for Li-ion batteries studied in this paper combines direct
The experience of GPU calculations at Lunarc
Sjöström, Anders; Lindemann, Jonas; Church, Ross
2011-09-01
To meet the ever increasing demand for computational speed and use of ever larger datasets, multi GPU instal- lations look very tempting. Lunarc and the Theoretical Astrophysics group at Lund Observatory collaborate on a pilot project to evaluate and utilize multi-GPU architectures for scientific calculations. Starting with a small workshop in 2009, continued investigations eventually lead to the procurement of the GPU-resource Timaeus, which is a four-node eight-GPU cluster with two Nvidia m2050 GPU-cards per node. The resource is housed within the larger cluster Platon and share disk-, network- and system resources with that cluster. The inaugu- ration of Timaeus coincided with the meeting "Computational Physics with GPUs" in November 2010, hosted by the Theoretical Astrophysics group at Lund Observatory. The meeting comprised of a two-day workshop on GPU-computing and a two-day science meeting on using GPUs as a tool for computational physics research, with a particular focus on astrophysics and computational biology. Today Timaeus is used by research groups from Lund, Stockholm and Lule in fields ranging from Astrophysics to Molecular Chemistry. We are investigating the use of GPUs with commercial software packages and user supplied MPI-enabled codes. Looking ahead, Lunarc will be installing a new cluster during the summer of 2011 which will have a small number of GPU-enabled nodes that will enable us to continue working with the combination of parallel codes and GPU-computing. It is clear that the combination of GPUs/CPUs is becoming an important part of high performance computing and here we will describe what has been done at Lunarc regarding GPU-computations and how we will continue to investigate the new and coming multi-GPU servers and how they can be utilized in our environment.
求解矩阵特征值的GPU实现%GPU Implementation for Solving Eigenvalues of a Matrix
Institute of Scientific and Technical Information of China (English)
夏健明; 魏德敏
2008-01-01
A GPU (graphics processing unit) implementation was presented for solving eigenvalues of a matrix. The power method and the QR method based on the GPU are used to solve the largest eigenvalue and all eigenvalues of a given matrix. The computations are compared with those by the CPU, and it is found that the computation accuracy is good, and the running time on the GPU is faster than that on the CPU by a factor of 2.7～7.6.%提出了求解矩阵特征值的GPU(图形处理器)实现方法,分别用基于GPU的幂法和QR法求解矩阵的最大特征值和所有特征值.基于GPU的计算与基于CPU的计算相比较,证实其计算精度较好,运算时间比基于CPU的运算时间快2.7～7.6倍.
Memory-Scalable GPU Spatial Hierarchy Construction.
Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D
2011-04-01
Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.
Fast quantum Monte Carlo on a GPU
Lutsyshyn, Y
2013-01-01
We present a scheme for the parallelization of quantum Monte Carlo on graphical processing units, focusing on bosonic systems and variational Monte Carlo. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent acceleration. Comparing with single core execution, GPU-accelerated code runs over x100 faster. The CUDA code is provided along with the package that is necessary to execute variational Monte Carlo for a system representing liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the latest Kepler architecture K20 GPU. Kepler-specific optimization is discussed.
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
2017-06-24
The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn (2)) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to
Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
Januszewski, M.; Kostur, M.
2014-09-01
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. Catalogue identifier: AETA_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETA_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU Lesser General Public License, version 3 No. of lines in distributed program, including test data, etc.: 225864 No. of bytes in distributed program, including test data, etc.: 46861049 Distribution format: tar.gz Programming language: Python, CUDA C, OpenCL. Computer: Any with an OpenCL or CUDA-compliant GPU. Operating system: No limits (tested on Linux and Mac OS X). RAM: Hundreds of megabytes to tens of gigabytes for typical cases. Classification: 12, 6.5. External routines: PyCUDA/PyOpenCL, Numpy, Mako, ZeroMQ (for multi-GPU simulations), scipy, sympy Nature of problem: GPU-accelerated simulation of single- and multi-component fluid flows. Solution method: A wide range of relaxation models (LBGK, MRT, regularized LB, ELBM, Shan-Chen, free energy, free surface) and boundary conditions within the lattice
Plain Polynomial Arithmetic on GPU
Anisul Haque, Sardar; Moreno Maza, Marc
2012-10-01
As for serial code on CPUs, parallel code on GPUs for dense polynomial arithmetic relies on a combination of asymptotically fast and plain algorithms. Those are employed for data of large and small size, respectively. Parallelizing both types of algorithms is required in order to achieve peak performances. In this paper, we show that the plain dense polynomial multiplication can be efficiently parallelized on GPUs. Remarkably, it outperforms (highly optimized) FFT-based multiplication up to degree 212 while on CPU the same threshold is usually at 26. We also report on a GPU implementation of the Euclidean Algorithm which is both work-efficient and runs in linear time for input polynomials up to degree 218 thus showing the performance of the GCD algorithm based on systolic arrays.
CULA: hybrid GPU accelerated linear algebra routines
Humphrey, John R.; Price, Daniel K.; Spagnoli, Kyle E.; Paolini, Aaron L.; Kelmelis, Eric J.
2010-04-01
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach
Directory of Open Access Journals (Sweden)
Ziv-Ukelson Michal
2011-08-01
Full Text Available Abstract Background RNA secondary structure prediction is a mainstream bioinformatic domain, and is key to computational analysis of functional RNA. In more than 30 years, much research has been devoted to defining different variants of RNA structure prediction problems, and to developing techniques for improving prediction quality. Nevertheless, most of the algorithms in this field follow a similar dynamic programming approach as that presented by Nussinov and Jacobson in the late 70's, which typically yields cubic worst case running time algorithms. Recently, some algorithmic approaches were applied to improve the complexity of these algorithms, motivated by new discoveries in the RNA domain and by the need to efficiently analyze the increasing amount of accumulated genome-wide data. Results We study Valiant's classical algorithm for Context Free Grammar recognition in sub-cubic time, and extract features that are common to problems on which Valiant's approach can be applied. Based on this, we describe several problem templates, and formulate generic algorithms that use Valiant's technique and can be applied to all problems which abide by these templates, including many problems within the world of RNA Secondary Structures and Context Free Grammars. Conclusions The algorithms presented in this paper improve the theoretical asymptotic worst case running time bounds for a large family of important problems. It is also possible that the suggested techniques could be applied to yield a practical speedup for these problems. For some of the problems (such as computing the RNA partition function and base-pair binding probabilities, the presented techniques are the only ones which are currently known for reducing the asymptotic running time bounds of the standard algorithms.
Directory of Open Access Journals (Sweden)
Mayara V Damasceno
Full Text Available Previous studies report that static stretching (SS impairs running economy. Assuming that pacing strategy relies on rate of energy use, this study aimed to determine whether SS would modify pacing strategy and performance in a 3-km running time-trial.Eleven recreational distance runners performed a a constant-speed running test without previous SS and a maximal incremental treadmill test; b an anthropometric assessment and a constant-speed running test with previous SS; c a 3-km time-trial familiarization on an outdoor 400-m track; d and e two 3-km time-trials, one with SS (experimental situation and another without (control situation previous static stretching. The order of the sessions d and e were randomized in a counterbalanced fashion. Sit-and-reach and drop jump tests were performed before the 3-km running time-trial in the control situation and before and after stretching exercises in the SS. Running economy, stride parameters, and electromyographic activity (EMG of vastus medialis (VM, biceps femoris (BF and gastrocnemius medialis (GA were measured during the constant-speed tests.The overall running time did not change with condition (SS 11:35±00:31 s; control 11:28±00:41 s, p = 0.304, but the first 100 m was completed at a significantly lower velocity after SS. Surprisingly, SS did not modify the running economy, but the iEMG for the BF (+22.6%, p = 0.031, stride duration (+2.1%, p = 0.053 and range of motion (+11.1%, p = 0.0001 were significantly modified. Drop jump height decreased following SS (-9.2%, p = 0.001.Static stretch impaired neuromuscular function, resulting in a slow start during a 3-km running time-trial, thus demonstrating the fundamental role of the neuromuscular system in the self-selected speed during the initial phase of the race.
Landau gauge fixing on the lattice using GPU's
Cardoso, Nuno; Oliveira, Orlando; Bicudo, Pedro
2013-01-01
In this work, we consider the GPU implementation of the steepest descent method with Fourier acceleration for Laudau gauge fixing, using CUDA. The performance of the code in a Tesla C2070 GPU is compared with a parallel CPU implementation.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the
Accelerate micromagnetic simulations with GPU programming in MATLAB
Zhu, Ru
2015-01-01
A finite-difference Micromagnetic simulation code written in MATLAB is presented with Graphics Processing Unit (GPU) acceleration. The high performance of Graphics Processing Unit (GPU) is demonstrated compared to a typical Central Processing Unit (CPU) based code. The speed-up of GPU to CPU is shown to be greater than 30 for problems with larger sizes on a mid-end GPU in single precision. The code is less than 200 lines and suitable for new algorithm developing.
Accelerate micromagnetic simulations with GPU programming in MATLAB
Zhu, Ru
2015-01-01
A finite-difference Micromagnetic simulation code written in MATLAB is presented with Graphics Processing Unit (GPU) acceleration. The high performance of Graphics Processing Unit (GPU) is demonstrated compared to a typical Central Processing Unit (CPU) based code. The speed-up of GPU to CPU is shown to be greater than 30 for problems with larger sizes on a mid-end GPU in single precision. The code is less than 200 lines and suitable for new algorithm developing.
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing
Directory of Open Access Journals (Sweden)
Junghoon Lee
2011-03-01
Full Text Available Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
Design and development of a run-time monitor for multi-core architectures in cloud computing.
Kang, Mikyung; Kang, Dong-In; Crago, Stephen P; Park, Gyung-Leen; Lee, Junghoon
2011-01-01
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data.
MIGS-GPU: Microarray Image Gridding and Segmentation on the GPU.
Katsigiannis, Stamos; Zacharia, Eleni; Maroulis, Dimitris
2016-03-03
cDNA microarray is a powerful tool for simultaneously studying the expression level of thousands of genes. Nevertheless, the analysis of microarray images remains an arduous and challenging task due to the poor quality of the images which often suffer from noise, artifacts, and uneven background. In this work, the MIGS-GPU (Microarray Image Gridding and Segmentation on GPU) software for gridding and segmenting microarray images is presented. MIGS-GPU's computations are performed on the graphics processing unit (GPU) by means of the CUDA architecture in order to achieve fast performance and increase the utilization of available system resources. Evaluation on both real and synthetic cDNA microarray images showed that MIGS-GPU provides better performance than state-of-the-art alternatives, while the proposed GPU implementation achieves significantly lower computational times compared to the respective CPU approaches. Consequently, MIGS-GPU can be an advantageous and useful tool for biomedical laboratories, offering a userfriendly interface that requires minimum input in order to run.
Edge-preserving image denoising via group coordinate descent on the GPU.
McGaffin, Madison Gray; Fessler, Jeffrey A
2015-04-01
Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel 1D pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates in-place and only store the noisy data, denoised image, and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation. Both algorithms use the majorize-minimize framework to solve the 1D pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
Gibert, Xavier; Patel, Vishal M.; Labate, Demetrio; Chellappa, Rama
2014-12-01
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
GPU-accelerated raster map reprojection
Directory of Open Access Journals (Sweden)
Petr Sloup
2016-07-01
Full Text Available Reprojecting raster maps from one projection to another is an essential part of many cartographic processes (map comparison, overlays, data presentation, ... and reducing the required computational time is desirable and often significantly decreases overall processing costs.The raster reprojection process operates per-pixel and is, therefore, a good candidate for GPU-based parallelization where the large number of processors can lead to a very high degree of parallelism.We have created an experimental implementation of the raster reprojection with GPU-based parallelization (using OpenCL API.During the evaluation, we compared the performance of our implementation to the optimized GDAL and showed that there is a class of problems where GPU-based parallelization can lead to more than sevenfold speedup.
GPU-accelerated computation of electron transfer.
Höfinger, Siegfried; Acocella, Angela; Pop, Sergiu C; Narumi, Tetsu; Yasuoka, Kenji; Beu, Titus; Zerbetto, Francesco
2012-11-05
Electron transfer is a fundamental process that can be studied with the help of computer simulation. The underlying quantum mechanical description renders the problem a computationally intensive application. In this study, we probe the graphics processing unit (GPU) for suitability to this type of problem. Time-critical components are identified via profiling of an existing implementation and several different variants are tested involving the GPU at increasing levels of abstraction. A publicly available library supporting basic linear algebra operations on the GPU turns out to accelerate the computation approximately 50-fold with minor dependence on actual problem size. The performance gain does not compromise numerical accuracy and is of significant value for practical purposes. Copyright © 2012 Wiley Periodicals, Inc.
ITS Cluster Finding Algorithm on GPU
Changaival, Boonyarit
2014-01-01
ITS cluster finding algorithm is one of the data reduction algorithms at ALICE. It needs to be processed fast due to a high amount of data readout from the detector. A variety of platforms were studied for the system design. My work is to design, implement and benchmark this algorithm on a GPU platform. GPU is one of many platform that promote parallel computing. A high-end GPU can contain over 2000 processing cores comparing to the commodity CPUs which have only four cores. The program is written in C and CUDA library. The throughput (Number of events per second) is used as a metric to measure the performance. With the latest implementation, the throughput was increased by a factor of 5.
GPU based framework for geospatial analyses
Cosmin Sandric, Ionut; Ionita, Cristian; Dardala, Marian; Furtuna, Titus
2017-04-01
Parallel processing on multiple CPU cores is already used at large scale in geocomputing, but parallel processing on graphics cards is just at the beginning. Being able to use an simple laptop with a dedicated graphics card for advanced and very fast geocomputation is an advantage that each scientist wants to have. The necessity to have high speed computation in geosciences has increased in the last 10 years, mostly due to the increase in the available datasets. These datasets are becoming more and more detailed and hence they require more space to store and more time to process. Distributed computation on multicore CPU's and GPU's plays an important role by processing one by one small parts from these big datasets. These way of computations allows to speed up the process, because instead of using just one process for each dataset, the user can use all the cores from a CPU or up to hundreds of cores from GPU The framework provide to the end user a standalone tools for morphometry analyses at multiscale level. An important part of the framework is dedicated to uncertainty propagation in geospatial analyses. The uncertainty may come from the data collection or may be induced by the model or may have an infinite sources. These uncertainties plays important roles when a spatial delineation of the phenomena is modelled. Uncertainty propagation is implemented inside the GPU framework using Monte Carlo simulations. The GPU framework with the standalone tools proved to be a reliable tool for modelling complex natural phenomena The framework is based on NVidia Cuda technology and is written in C++ programming language. The code source will be available on github at https://github.com/sandricionut/GeoRsGPU Acknowledgement: GPU framework for geospatial analysis, Young Researchers Grant (ICUB-University of Bucharest) 2016, director Ionut Sandric
GPU-accelerated voxelwise hepatic perfusion quantification.
Wang, H; Cao, Y
2012-09-07
Voxelwise quantification of hepatic perfusion parameters from dynamic contrast enhanced (DCE) imaging greatly contributes to assessment of liver function in response to radiation therapy. However, the efficiency of the estimation of hepatic perfusion parameters voxel-by-voxel in the whole liver using a dual-input single-compartment model requires substantial improvement for routine clinical applications. In this paper, we utilize the parallel computation power of a graphics processing unit (GPU) to accelerate the computation, while maintaining the same accuracy as the conventional method. Using compute unified device architecture-GPU, the hepatic perfusion computations over multiple voxels are run across the GPU blocks concurrently but independently. At each voxel, nonlinear least-squares fitting the time series of the liver DCE data to the compartmental model is distributed to multiple threads in a block, and the computations of different time points are performed simultaneously and synchronically. An efficient fast Fourier transform in a block is also developed for the convolution computation in the model. The GPU computations of the voxel-by-voxel hepatic perfusion images are compared with ones by the CPU using the simulated DCE data and the experimental DCE MR images from patients. The computation speed is improved by 30 times using a NVIDIA Tesla C2050 GPU compared to a 2.67 GHz Intel Xeon CPU processor. To obtain liver perfusion maps with 626 400 voxels in a patient's liver, it takes 0.9 min with the GPU-accelerated voxelwise computation, compared to 110 min with the CPU, while both methods result in perfusion parameters differences less than 10(-6). The method will be useful for generating liver perfusion images in clinical settings.
GPU Pro 4 advanced rendering techniques
Engel, Wolfgang
2013-01-01
GPU Pro4: Advanced Rendering Techniques presents ready-to-use ideas and procedures that can help solve many of your day-to-day graphics programming challenges. Focusing on interactive media and games, the book covers up-to-date methods producing real-time graphics. Section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Sebastien St-Laurent have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book begins with discussions on the abi
GPU Pro 5 advanced rendering techniques
Engel, Wolfgang
2014-01-01
In GPU Pro5: Advanced Rendering Techniques, section editors Wolfgang Engel, Christopher Oat, Carsten Dachsbacher, Michal Valient, Wessam Bahnassi, and Marius Bjorge have once again assembled a high-quality collection of cutting-edge techniques for advanced graphics processing unit (GPU) programming. Divided into six sections, the book covers rendering, lighting, effects in image space, mobile devices, 3D engine design, and compute. It explores rasterization of liquids, ray tracing of art assets that would otherwise be used in a rasterized engine, physically based area lights, volumetric light
Triggering events with GPU at ATLAS
Kama, Sami; The ATLAS collaboration
2015-01-01
The growing complexity of events produced in LHC collisions demands more and more computing power both for the on line selection and for the offline reconstruction of events. In recent years, the explosive performance growth of massively parallel processors like Graphical Processing Units both in computing power and in low energy consumption, make GPU extremely attractive for using them in a complex high energy experiment like ATLAS. Together with the optimization of reconstruction algorithms exploiting this new massively parallel paradigm, a small scale prototype of the full ATLAS High Level Trigger exploiting GPU has been implemented. We discuss the integration procedure of this prototype, the achieved performance and the prospects for the future.
Bédorf, Jeroen; Zwart, Simon Portegies
2012-01-01
We present a gravitational hierarchical N-body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are executed on the GPU which eliminates the need for data transfer between the Central Processing Unit (CPU) and the GPU. Our tests indicate that the gravitational tree-code outperforms tuned CPU code for all parts of the algorithm and show an overall performance improvement of more than a factor 20, resulting in a processing rate of more than 2.8 million particles per second.
Colloquium: Large scale simulations on GPU clusters
Bernaschi, Massimo; Bisson, Mauro; Fatica, Massimiliano
2015-06-01
Graphics processing units (GPU) are currently used as a cost-effective platform for computer simulations and big-data processing. Large scale applications require that multiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times, sub-optimal because the GPU features are not exploited at their best. We describe how it is possible to achieve an excellent efficiency for applications in statistical mechanics, particle dynamics and networks analysis by using suitable memory access patterns and mechanisms like CUDA streams, profiling tools, etc. Similar concepts and techniques may be applied also to other problems like the solution of Partial Differential Equations.
Howe, A E; Whitley, L D; 10.1613/jair.1576
2011-01-01
Tabu search is one of the most effective heuristics for locating high-quality solutions to a diverse array of NP-hard combinatorial optimization problems. Despite the widespread success of tabu search, researchers have a poor understanding of many key theoretical aspects of this algorithm, including models of the high-level run-time dynamics and identification of those search space features that influence problem difficulty. We consider these questions in the context of the job-shop scheduling problem (JSP), a domain where tabu search algorithms have been shown to be remarkably effective. Previously, we demonstrated that the mean distance between random local optima and the nearest optimal solution is highly correlated with problem difficulty for a well-known tabu search algorithm for the JSP introduced by Taillard. In this paper, we discuss various shortcomings of this measure and develop a new model of problem difficulty that corrects these deficiencies. We show that Taillards algorithm can be modeled with ...
Phase-step calibration technique based on a two-run-times-two-frame phase-shift method.
Zhong, Xianghong
2006-12-10
A novel phase-step calibration technique is presented on the basis of a two-run-times-two-frame phase-shift method. First the symmetry factor M is defined to describe the distribution property of the distorted phase due to phase-shifter miscalibration; then the phase-step calibration technique, in which two sets of two interferograms with a straight fringe pattern are recorded and the phase step is obtained by calculating M of the wrapped phase map, is developed. With this technique, a good mirror is required, but no uniform illumination is needed and no complex mathematical operation is involved. This technique can be carried out in situ and is applicable to any phase shifter, whether linear or nonlinear.
SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.
Directory of Open Access Journals (Sweden)
Matija Korpar
Full Text Available In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.
Peabody, Hume; Guerrero, Sergio; Hawk, John; Rodriguez, Juan; McDonald, Carson; Jackson, Cliff
2016-01-01
The Wide Field Infrared Survey Telescope using Astrophysics Focused Telescope Assets (WFIRST-AFTA) utilizes an existing 2.4 m diameter Hubble sized telescope donated from elsewhere in the federal government for near-infrared sky surveys and Exoplanet searches to answer crucial questions about the universe and dark energy. The WFIRST design continues to increase in maturity, detail, and complexity with each design cycle leading to a Mission Concept Review and entrance to the Mission Formulation Phase. Each cycle has required a Structural-Thermal-Optical-Performance (STOP) analysis to ensure the design can meet the stringent pointing and stability requirements. As such, the models have also grown in size and complexity leading to increased model run time. This paper addresses efforts to reduce the run time while still maintaining sufficient accuracy for STOP analyses. A technique was developed to identify slews between observing orientations that were sufficiently different to warrant recalculation of the environmental fluxes to reduce the total number of radiation calculation points. The inclusion of a cryocooler fluid loop in the model also forced smaller time-steps than desired, which greatly increases the overall run time. The analysis of this fluid model required mitigation to drive the run time down by solving portions of the model at different time scales. Lastly, investigations were made into the impact of the removal of small radiation couplings on run time and accuracy. Use of these techniques allowed the models to produce meaningful results within reasonable run times to meet project schedule deadlines.
Parallel generation of architecture on the GPU
Steinberger, Markus
2014-05-01
In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit (GPU). Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity. © 2014 The Author(s) Computer Graphics Forum © 2014 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd.
Graph coarsening and clustering on the GPU
Fagginger Auer, B.O.; Bisseling, R.H.
2013-01-01
Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph co
Fully 3D GPU PET reconstruction
Energy Technology Data Exchange (ETDEWEB)
Herraiz, J.L., E-mail: joaquin@nuclear.fis.ucm.es [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Espana, S. [Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA (United States); Cal-Gonzalez, J. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain); Vaquero, J.J. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Desco, M. [Departmento de Bioingenieria e Ingenieria Espacial, Universidad Carlos III, Madrid (Spain); Unidad de Medicina y Cirugia Experimental, Hospital General Universitario Gregorio Maranon, Madrid (Spain); Udias, J.M. [Grupo de Fisica Nuclear, Departmento Fisica Atomica, Molecular y Nuclear, Universidad Complutense de Madrid (Spain)
2011-08-21
Fully 3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) has been proposed for many years as potential accelerators in complex scientific problems, but it has not been used until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on GPUs. This work presents a GPU-based fully 3D PET iterative reconstruction software. This new code may reconstruct sinogram data from several commercially available PET scanners. The most important and time-consuming parts of the code, the forward and backward projection operations, are based on an accurate model of the scanner obtained with the Monte Carlo code PeneloPET and they have been massively parallelized on the GPU. For the PET scanners considered, the GPU-based code is more than 70 times faster than a similar code running on a single core of a fast CPU, obtaining in both cases the same images. The code has been designed to be easily adapted to reconstruct sinograms from any other PET scanner, including scanner prototypes.
GPU Accelerated Surgical Simulators for Complex Morhpology
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
a springmass system in order to simulate a complex organ such as the heart. Computations are accelerated by taking advantage of modern graphics processing units (GPUs). Two GPU implementations are presented. They vary in their generality of spring connections and in the speedup factor they achieve...
Graph coarsening and clustering on the GPU
Fagginger Auer, B.O.; Bisseling, R.H.
2013-01-01
Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph co
Synthetic Aperture Beamformation using the GPU
DEFF Research Database (Denmark)
Hansen, Jens Munk; Schaa, Dana; Jensen, Jørgen Arendt
2011-01-01
A synthetic aperture ultrasound beamformer is implemented for a GPU using the OpenCL framework. The implementation supports beamformation of either RF signals or complex baseband signals. Transmit and receive apodization can be either parametric or dynamic using a fixed F-number, a reference, and...... workstation with 2 quad-core Xeon-processors....
Using GPU shaders for visualization, part 3.
Bailey, Mike
2013-01-01
GPU shaders aren't just for glossy special effects. Parts 1 and 2 of this discussion looked at using them for point clouds, cutting planes, line integral convolution, and terrain bump-mapping. Part 3 covers compute shaders and shader storage buffer objects-two features announced as part of OpenGL 4.3.
Graph coarsening and clustering on the GPU
Fagginger Auer, B.O.|info:eu-repo/dai/nl/326659072; Bisseling, R.H.|info:eu-repo/dai/nl/304828068
2013-01-01
Agglomerative clustering is an effective greedy way to quickly generate graph clusterings of high modularity in a small amount of time. In an effort to use the power offered by multi-core CPU and GPU hardware to solve the clustering problem, we introduce a fine-grained sharedmemory parallel graph
Accelerated GPU based SPECT Monte Carlo simulations
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-01
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: 99m Tc, 111In and 131I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational efficiency
Accelerated GPU based SPECT Monte Carlo simulations.
Garcia, Marie-Paule; Bert, Julien; Benoit, Didier; Bardiès, Manuel; Visvikis, Dimitris
2016-06-07
Monte Carlo (MC) modelling is widely used in the field of single photon emission computed tomography (SPECT) as it is a reliable technique to simulate very high quality scans. This technique provides very accurate modelling of the radiation transport and particle interactions in a heterogeneous medium. Various MC codes exist for nuclear medicine imaging simulations. Recently, new strategies exploiting the computing capabilities of graphical processing units (GPU) have been proposed. This work aims at evaluating the accuracy of such GPU implementation strategies in comparison to standard MC codes in the context of SPECT imaging. GATE was considered the reference MC toolkit and used to evaluate the performance of newly developed GPU Geant4-based Monte Carlo simulation (GGEMS) modules for SPECT imaging. Radioisotopes with different photon energies were used with these various CPU and GPU Geant4-based MC codes in order to assess the best strategy for each configuration. Three different isotopes were considered: (99m) Tc, (111)In and (131)I, using a low energy high resolution (LEHR) collimator, a medium energy general purpose (MEGP) collimator and a high energy general purpose (HEGP) collimator respectively. Point source, uniform source, cylindrical phantom and anthropomorphic phantom acquisitions were simulated using a model of the GE infinia II 3/8" gamma camera. Both simulation platforms yielded a similar system sensitivity and image statistical quality for the various combinations. The overall acceleration factor between GATE and GGEMS platform derived from the same cylindrical phantom acquisition was between 18 and 27 for the different radioisotopes. Besides, a full MC simulation using an anthropomorphic phantom showed the full potential of the GGEMS platform, with a resulting acceleration factor up to 71. The good agreement with reference codes and the acceleration factors obtained support the use of GPU implementation strategies for improving computational
CPU and GPU (Cuda Template Matching Comparison
Directory of Open Access Journals (Sweden)
Evaldas Borcovas
2014-05-01
Full Text Available Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I, NVidia GeForce GT320M CUDAcompliable graphics card (GPU I and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II, NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II.Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have.
GPU-based four-dimensional general-relativistic ray tracing
Kuchelmeister, Daniel; Müller, Thomas; Ament, Marco; Wunner, Günter; Weiskopf, Daniel
2012-10-01
-Rendering via OpenGL. Running time: Problem dependent, several seconds up to hours.
CSIR Research Space (South Africa)
Govender, Nicolin
2015-09-01
Full Text Available consideration due to the architectural differences between CPU and GPU platforms. This paper describes the DEM algorithms and heuristics that are optimized for the parallel NVIDIA Kepler GPU architecture in detail. This includes a GPU optimized collision...
GPU accelerated chemical similarity calculation for compound library comparison.
Ma, Chao; Wang, Lirong; Xie, Xiang-Qun
2011-07-25
Chemical similarity calculation plays an important role in compound library design, virtual screening, and "lead" optimization. In this manuscript, we present a novel GPU-accelerated algorithm for all-vs-all Tanimoto matrix calculation and nearest neighbor search. By taking advantage of multicore GPU architecture and CUDA parallel programming technology, the algorithm is up to 39 times superior to the existing commercial software that runs on CPUs. Because of the utilization of intrinsic GPU instructions, this approach is nearly 10 times faster than existing GPU-accelerated sparse vector algorithm, when Unity fingerprints are used for Tanimoto calculation. The GPU program that implements this new method takes about 20 min to complete the calculation of Tanimoto coefficients between 32 M PubChem compounds and 10K Active Probes compounds, i.e., 324G Tanimoto coefficients, on a 128-CUDA-core GPU.
Simulating and Visualizing Real-Time Crowds on GPU Clusters
Benjamín Hernández; Hugo Pérez; Isaac Rudomin; Sergio Ruiz; Oriam de Gyves; Leonel Toledo
2014-01-01
We present a set of algorithms for simulating and visualizing real-time crowds in GPU (Graphics Processing Units) clusters. First we present crowd simulation and rendering techniques that take advantage of single GPU machines. Then, using as an example a wandering crowd behavior simulation algorithm, we explain how this kind of algorithms can be extended for their use in GPU cluster environments. We also present a visualization architecture that renders the simulation results using detailed 3...
Haptic Feedback for the GPU-based Surgical Simulator
DEFF Research Database (Denmark)
Sørensen, Thomas Sangild; Mosegaard, Jesper
2006-01-01
The GPU has proven to be a powerful processor to compute spring-mass based surgical simulations. It has not previously been shown however, how to effectively implement haptic interaction with a simulation running entirely on the GPU. This paper describes a method to calculate haptic feedback...... with limited performance cost. It allows easy balancing of the GPU workload between calculations of simulation, visualisation, and the haptic feedback....
Evaluating the Power of GPU Acceleration for IDW Interpolation Algorithm
Gang Mei
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental resul...
Implementing Ultrasound Beamforming on the GPU using CUDA
Grønvold, Lars
2008-01-01
This thesis discusses the implementation of ultrasound beamforming on the GPU using CUDA. Fractional delay filters and the need for it when implementing beamforming is discussed. An introduction to CUDA programming is given as well as a study of the workings of the NVIDIA Tesla GPU(or G80). A number of suggestions for implementing beamforming on a GPU is presented as well as an actual implementation and an evaluation of it's performance.
Large Scale Simulations of the Euler Equations on GPU Clusters
Liebmann, Manfred
2010-08-01
The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for the slow Gigabit Ethernet network between the GPU compute nodes and to maintain overall efficiency. A diesel engine intake-port and a nozzle, meshed in different resolutions, give good real world examples for the scalability tests on the GPU cluster. © 2010 IEEE.
Design and Implementation of GPU-Based Prim's Algorithm
Directory of Open Access Journals (Sweden)
Wei Wang
2011-07-01
Full Text Available Minimum spanning tree is a classical problem in graph theory that plays a key role in a broad domain of applications. This paper proposes a minimum spanning tree algorithm using Prim's approach on Nvidia GPU under CUDA architecture. By using new developed GPU-based Min-Reduction data parallel primitive in the key step of the algorithm, higher efficiency is achieved. Experimental results show that we obtain about 2 times speedup on Nvidia GTX260 GPU over the CPU implementation and 3 times speedup over non-primitives GPU implementation.
Adaptive Remote Sensing Texture Compression on GPU
Directory of Open Access Journals (Sweden)
Xiao-Xia Lu
2010-11-01
Full Text Available Considering the properties of remote sensing texture such as strong randomness and weak local correlation, a novel adaptive compression method based on vector quantizer is presented and implemented on GPU. Utilizing the property of Human Visual System (HVS, a new similarity measurement function is designed instead of using Euclid distance. Correlated threshold between blocks can be obtained adaptively according to the property of different images without artificial auxiliary. Furthermore, a self-adaptive threshold adjustment during the compression is designed to improve the reconstruct quality. Experiments show that the method can handle various resolution images adaptively. It can achieve satisfied compression rate and reconstruct quality at the same time. Index is coded to further increase the compression rate. The coding way is designed to guarantee accessing the index randomly too. Furthermore, the compression and decompression process is speed up with the usage of GPU, on account of their parallelism.
Implementation of Membrane Algorithms on GPU
Directory of Open Access Journals (Sweden)
Xingyi Zhang
2014-01-01
Full Text Available Membrane algorithms are a new class of parallel algorithms, which attempt to incorporate some components of membrane computing models for designing efficient optimization algorithms, such as the structure of the models and the way of communication between cells. Although the importance of the parallelism of such algorithms has been well recognized, membrane algorithms were usually implemented on the serial computing device central processing unit (CPU, which makes the algorithms unable to work in an efficient way. In this work, we consider the implementation of membrane algorithms on the parallel computing device graphics processing unit (GPU. In such implementation, all cells of membrane algorithms can work simultaneously. Experimental results on two classical intractable problems, the point set matching problem and TSP, show that the GPU implementation of membrane algorithms is much more efficient than CPU implementation in terms of runtime, especially for solving problems with a high complexity.
Accelerating Dense Linear Algebra on the GPU
DEFF Research Database (Denmark)
Sørensen, Hans Henrik Brandenborg
and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. The target hardware is the most recent NVIDIA Tesla 20-series (Fermi...... architecture). Most of the techniques I discuss for accelerating dense linear algebra are applicable to memory-bound GPU algorithms in general....
Solving global optimization problems on GPU cluster
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
ALICE HLT high speed tracking on GPU
Gorbunov, Sergey; Aamodt, Kenneth; Alt, Torsten; Appelshauser, Harald; Arend, Andreas; Bach, Matthias; Becker, Bruce; Bottger, Stefan; Breitner, Timo; Busching, Henner; Chattopadhyay, Sukalyan; Cleymans, Jean; Cicalo, Corrado; Das, Indranil; Djuvsland, Oystein; Engel, Heiko; Erdal, Hege Austrheim; Fearick, Roger; Haaland, Oystein Senneset; Hille, Per Thomas; Kalcher, Sebastian; Kanaki, Kalliopi; Kebschull, Udo Wolfgang; Kisel, Ivan; Kretz, Matthias; Lara, Camillo; Lindal, Sven; Lindenstruth, Volker; Masoodi, Arshad Ahmad; Ovrebekk, Gaute; Panse, Ralf; Peschek, Jorg; Ploskon, Mateusz; Pocheptsov, Timur; Ram, Dinesh; Rascanu, Theodor; Richter, Matthias; Rohrich, Dieter; Ronchetti, Federico; Skaali, Bernhard; Smorholm, Olav; Stokkevag, Camilla; Steinbeck, Timm Morten; Szostak, Artur; Thader, Jochen; Tveter, Trine; Ullaland, Kjetil; Vilakazi, Zeblon; Weis, Robert; Yin, Zhong-Bao; Zelnicek, Pierre
2011-01-01
The on-line event reconstruction in ALICE is performed by the High Level Trigger, which should process up to 2000 events per second in proton-proton collisions and up to 300 central events per second in heavy-ion collisions, corresponding to an inp ut data stream of 30 GB/s. In order to fulfill the time requirements, a fast on-line tracker has been developed. The algorithm combines a Cellular Automaton method being used for a fast pattern recognition and the Kalman Filter method for fitting of found trajectories and for the final track selection. The tracker was adapted to run on Graphics Processing Units (GPU) using the NVIDIA Compute Unified Device Architecture (CUDA) framework. The implementation of the algorithm had to be adjusted at many points to allow for an efficient usage of the graphics cards. In particular, achieving a good overall workload for many processor cores, efficient transfer to and from the GPU, as well as optimized utilization of the different memories the GPU offers turned out to be cri...
GPU-based video motion magnification
DomŻał, Mariusz; Jedrasiak, Karol; Sobel, Dawid; Ryt, Artur; Nawrat, Aleksander
2016-06-01
Video motion magnification (VMM) allows people see otherwise not visible subtle changes in surrounding world. VMM is also capable of hiding them with a modified version of the algorithm. It is possible to magnify motion related to breathing of patients in hospital to observe it or extinguish it and extract other information from stabilized image sequence for example blood flow. In both cases we would like to perform calculations in real time. Unfortunately, the VMM algorithm requires a great amount of computing power. In the article we suggest that VMM algorithm can be parallelized (each thread processes one pixel) and in order to prove that we implemented the algorithm on GPU using CUDA technology. CPU is used only to grab, write, display frame and schedule work for GPU. Each GPU kernel performs spatial decomposition, reconstruction and motion amplification. In this work we presented approach that achieves a significant speedup over existing methods and allow to VMM process video in real-time. This solution can be used as preprocessing for other algorithms in more complex systems or can find application wherever real time motion magnification would be useful. It is worth to mention that the implementation runs on most modern desktops and laptops compatible with CUDA technology.
Fast box-counting algorithm on GPU.
Jiménez, J; Ruiz de Miras, J
2012-12-01
The box-counting algorithm is one of the most widely used methods for calculating the fractal dimension (FD). The FD has many image analysis applications in the biomedical field, where it has been used extensively to characterize a wide range of medical signals. However, computing the FD for large images, especially in 3D, is a time consuming process. In this paper we present a fast parallel version of the box-counting algorithm, which has been coded in CUDA for execution on the Graphic Processing Unit (GPU). The optimized GPU implementation achieved an average speedup of 28 times (28×) compared to a mono-threaded CPU implementation, and an average speedup of 7 times (7×) compared to a multi-threaded CPU implementation. The performance of our improved box-counting algorithm has been tested with 3D models with different complexity, features and sizes. The validity and accuracy of the algorithm has been confirmed using models with well-known FD values. As a case study, a 3D FD analysis of several brain tissues has been performed using our GPU box-counting algorithm.
A GPU Based Run-Time Quad-Tree Construction Method for Fast Terrain Rendering%GPU实时构建四叉树的快速地形渲染算法
Institute of Scientific and Technical Information of China (English)
李白云; 赵春霞
2010-01-01
针对传统四叉树场景渲染CPU占用率高、带宽开销大的缺陷,提出一种适合于GPU实现的四叉树场景分割和渲染算法.利用纹理和像素着色器实时构建四叉树,使用几何着色器实现GPU对四叉树的遍历和场景分割;针对已有的动态构建算法中裂缝消除算法难以用GPU实现的缺点,通过在四叉树构建中引入"过渡集"的概念,有效地消除了不同分辨率层次之间可能出现的裂缝.实验结果证明,与传统的动态构建算法相比,文中算法易于GPU实现,无需CPU干预,并降低了带宽开销,可以达到较高的帧速率.
GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform
Directory of Open Access Journals (Sweden)
Ronglin Jiang
2014-01-01
Full Text Available This paper introduces a (finite difference time domain FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI and Open Multiprocessing (OpenMP. Since both Central Processing Unit (CPU and Graphics Processing Unit (GPU resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with 16×18 elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.
Caplan, R. M.
2013-04-01
and both second- and fourth-order differencing in space. The integrators are written to run on NVIDIA GPUs and are interfaced with MATLAB including built-in visualization and analysis tools. Restrictions: The main restriction for the GPU integrators is the amount of RAM on the GPU as the code is currently only designed for running on a single GPU. Unusual features: Ability to visualize real-time simulations through the interaction of MATLAB and the compiled GPU integrators. Additional comments: Setup guide and Installation guide provided. Program has a dedicated web site at www.nlsemagic.com. Running time: A three-dimensional run with a grid dimension of 87×87×203 for 3360 time steps (100 non-dimensional time units) takes about one and a half minutes on a GeForce GTX 580 GPU card.
Suh, Joohyung; Ma, Kevin; Le, Anh
2011-03-01
Multiple Sclerosis (MS) is a disease which is caused by damaged myelin around axons of the brain and spinal cord. Currently, MR Imaging is used for diagnosis, but it is very highly variable and time-consuming since the lesion detection and estimation of lesion volume are performed manually. For this reason, we developed a CAD (Computer Aided Diagnosis) system which would assist segmentation of MS to facilitate physician's diagnosis. The MS CAD system utilizes K-NN (k-nearest neighbor) algorithm to detect and segment the lesion volume in an area based on the voxel. The prototype MS CAD system was developed under the MATLAB environment. Currently, the MS CAD system consumes a huge amount of time to process data. In this paper we will present the development of a second version of MS CAD system which has been converted into C/C++ in order to take advantage of the GPU (Graphical Processing Unit) which will provide parallel computation. With the realization of C/C++ and utilizing the GPU, we expect to cut running time drastically. The paper investigates the conversion from MATLAB to C/C++ and the utilization of a high-end GPU for parallel computing of data to improve algorithm performance of MS CAD.
GPU-based high-performance computing for radiation therapy.
Jia, Xun; Ziegenhein, Peter; Jiang, Steve B
2014-02-21
Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment and maintenance. Over the past few years, GPU-based high-performance computing in radiotherapy has experienced rapid developments. A tremendous amount of study has been conducted, in which large acceleration factors compared with the conventional CPU platform have been observed. In this paper, we will first give a brief introduction to the GPU hardware structure and programming model. We will then review the current applications of GPU in major imaging-related and therapy-related problems encountered in radiotherapy. A comparison of GPU with other platforms will also be presented.
Evaluating the Power of GPU Acceleration for IDW Interpolation Algorithm
Directory of Open Access Journals (Sweden)
Gang Mei
2014-01-01
Full Text Available We first present two GPU implementations of the standard Inverse Distance Weighting (IDW interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP. Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x∼6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications.
GPU in Physics Computation: Case Geant4 Navigation
Seiskari, Otto; Niemi, Tapio
2012-01-01
General purpose computing on graphic processing units (GPU) is a potential method of speeding up scientific computation with low cost and high energy efficiency. We experimented with the particle physics simulation toolkit Geant4 used at CERN to benchmark its geometry navigation functionality on a GPU. The goal was to find out whether Geant4 physics simulations could benefit from GPU acceleration and how difficult it is to modify Geant4 code to run in a GPU. We ported selected parts of Geant4 code to C99 & CUDA and implemented a simple gamma physics simulation utilizing this code to measure efficiency. The performance of the program was tested by running it on two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core AMD CPU system. Our conclusion was that GPUs can be a competitive alternate for multi-core computers but porting existing software in an efficient way is challenging.
Evaluating the power of GPU acceleration for IDW interpolation algorithm.
Mei, Gang
2014-01-01
We first present two GPU implementations of the standard Inverse Distance Weighting (IDW) interpolation algorithm, the tiled version that takes advantage of shared memory and the CDP version that is implemented using CUDA Dynamic Parallelism (CDP). Then we evaluate the power of GPU acceleration for IDW interpolation algorithm by comparing the performance of CPU implementation with three GPU implementations, that is, the naive version, the tiled version, and the CDP version. Experimental results show that the tilted version has the speedups of 120x and 670x over the CPU version when the power parameter p is set to 2 and 3.0, respectively. In addition, compared to the naive GPU implementation, the tiled version is about two times faster. However, the CDP version is 4.8x ∼ 6.0x slower than the naive GPU version, and therefore does not have any potential advantages in practical applications.
GPU-based ultrafast IMRT plan optimization
Men, Chunhua; Gu, Xuejun; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B.
2009-11-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evaluate our implementation. On an NVIDIA Tesla C1060 GPU card, we have achieved speedup factors of 20-40 without losing accuracy, compared to the results from an Intel Xeon 2.27 GHz CPU. For a specific nine-field prostate IMRT case with 5 × 5 mm2 beamlet size and 2.5 × 2.5 × 2.5 mm3 voxel size, our GPU implementation takes only 2.8 s to generate an optimal IMRT plan. Our work has therefore solved a major problem in developing online re-planning technologies for adaptive radiotherapy.
GPU-accelerated adjoint algorithmic differentiation
Gremse, Felix; Höfter, Andreas; Razik, Lukas; Kiessling, Fabian; Naumann, Uwe
2016-03-01
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the ;tape;. Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (GPUs), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using GPU-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and GPU memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The GPU version achieved an additional speedup of 7.5 ± 4.4, showing that the processing power of GPUs can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography.
Better Faster Noise with the GPU
DEFF Research Database (Denmark)
Wyvill, Geoff; Frisvad, Jeppe Revall
Filtered noise [Perlin 1985] has, for twenty years, been a fundamental tool for creating functional texture and it has many other applications; for example, animating water waves or the motion of grass waving in the wind. Perlin noise suffers from a number of defects and there have been many atte...... attempts to create better or faster noise but Perlin’s ‘Gradient Noise’ has consistently proved to be the best compromise between speed and quality. Our objective was to create a better noise cheaply by use of the GPU....
CFD Computations on Multi-GPU Configurations.
Menon, Sandeep; Perot, Blair
2007-11-01
Programmable graphics processors have shown favorable potential for use in practical CFD simulations -- often delivering a speed-up factor between 3 to 5 times over conventional CPUs. In recent times, most PCs are supplied with the option of installing multiple GPUs on a single motherboard, thereby providing the option of a parallel GPU configuration in a shared-memory paradigm. We demonstrate our implementation of an unstructured CFD solver using a set up which is configured to run two GPUs in parallel, and discuss its performance details.
Using GPU shaders for visualization, part 2.
Bailey, M
2011-01-01
GPU shaders aren't just for special effects. Previously, I looked at some uses for them in visualization. Here, the idea continues. Because visualization relies so much on high speed interaction, we use shaders for the same reason we use them in effects programming: appearance and performance. In the drive to understand large, complex data sets, no method should be overlooked. This article describes two additional visualization applications: line integral convolution (LIC) and terrain bump-mapping. I also comment on the recent (and rapid) changes to OpenGL and what these mean to educators.
Architecting the Finite Element Method Pipeline for the GPU.
Fu, Zhisong; Lewis, T James; Kirby, Robert M; Whitaker, Ross T
2014-02-01
The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multi-grid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87 × speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51 × versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.
Efficient implementation of MrBayes on multi-GPU.
Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang
2013-06-01
MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
Performance evaluation of image processing algorithms on the GPU.
Castaño-Díez, Daniel; Moser, Dominik; Schoenegger, Andreas; Pruggnaller, Sabine; Frangakis, Achilleas S
2008-10-01
The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations.
Accelerating the XGBoost algorithm using GPU computing
Directory of Open Access Journals (Sweden)
Rory Mitchell
2017-07-01
Full Text Available We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3× and 6× using a Titan X compared to a 4 core i7 CPU, and 1.2× using a Titan X compared to 2× Xeon CPUs (24 cores. We show that it is possible to process the Higgs dataset (10 million instances, 28 features entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks.
Image stylization with enhanced structure on GPU
Institute of Scientific and Technical Information of China (English)
LI Ping; SUN HanQiu; SHENG Bin; SHEN JianBing
2012-01-01
This paper presents a graphics processing unit (GPU) based stylization approach that preserves the fine structure between the original and the stylized images using gradient optimization.Existing abstraction and painterly stylization methods focused on contrast manipulation only,and thus the detailed salient structures of the input images are always destroyed when performing the current stylization techniques because of limitations like unavoidable salience information loss caused by contrast abstraction.We propose an image structure map to naturally model the fine structure existing in the original images.Gradient-based structure tangent generation and tangent-guided image morphology are used to construct the structure map. The image structure map,unlike an edge map,not only systematically models the boundary information within the imagery but also accentuates the underlying inner structure detail for further stylization.We facilitate the final stylization via parallel bilateral grid and structure-aware stylizing optimization on a GPU-CUDA platform in real time.In multiple experiments,the proposed method consistently demonstrates efficient and high quality image stylization performance.
Parallel hyperspectral compressive sensing method on GPU
Bernabé, Sergio; Martín, Gabriel; Nascimento, José M. P.
2015-10-01
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
GPU Computing to Improve Game Engine Performance
Directory of Open Access Journals (Sweden)
Abu Asaduzzaman
2014-07-01
Full Text Available Although the graphics processing unit (GPU was originally designed to accelerate the image creation for output to display, today’s general purpose GPU (GPGPU computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU. The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many “components” and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM, a multithreaded synchronous model (MSM, and an MSM with data parallelism (MSMDP. According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance.
Directory of Open Access Journals (Sweden)
Ma Ning
2013-09-01
Full Text Available Purpose: Nowadays, governments around the world are active in constructing the high-speed railway. Therefore, it is significant to make research on this increasingly prevalent transport.Design/methodology/approach: In this paper, we simulate the process of the passenger’s travel mode choice by adjusting the ticket fare and the run-time based on the multi-agent system (MAS.Findings: From the research we get the conclusion that increasing the run-time appropriately and reducing the ticket fare in some extent are effective ways to enhance the passenger sharing of the high-speed railway.Originality/value: We hope it can provide policy recommendations for the railway sectors in developing the long-term plan on high-speed railway in the future.
Directory of Open Access Journals (Sweden)
Kearney David
2007-01-01
Full Text Available We show how the limited electrical power and FPGA compute resources available in a swarm of small UAVs can be shared by moving FPGA tasks from one UAV to another. A software and hardware infrastructure that supports the mobility of embedded FPGA applications on a single FPGA chip and across a group of networked FPGA chips is an integral part of the work described here. It is shown how to allocate a single FPGA's resources at run time and to share a single device through the use of application checkpointing, a memory controller, and an on-chip run-time reconfigurable network. A prototype distributed operating system is described for managing mobile applications across the swarm based on the contents of a fuzzy rule base. It can move applications between UAVs in order to equalize power use or to enable the continuous replenishment of fully fueled planes into the swarm.
Directory of Open Access Journals (Sweden)
David Kearney
2007-02-01
Full Text Available We show how the limited electrical power and FPGA compute resources available in a swarm of small UAVs can be shared by moving FPGA tasks from one UAV to another. A software and hardware infrastructure that supports the mobility of embedded FPGA applications on a single FPGA chip and across a group of networked FPGA chips is an integral part of the work described here. It is shown how to allocate a single FPGA's resources at run time and to share a single device through the use of application checkpointing, a memory controller, and an on-chip run-time reconfigurable network. A prototype distributed operating system is described for managing mobile applications across the swarm based on the contents of a fuzzy rule base. It can move applications between UAVs in order to equalize power use or to enable the continuous replenishment of fully fueled planes into the swarm.
Bayesian Lasso and multinomial logistic regression on GPU.
Češnovar, Rok; Štrumbelj, Erik
2017-01-01
We describe an efficient Bayesian parallel GPU implementation of two classic statistical models-the Lasso and multinomial logistic regression. We focus on parallelizing the key components: matrix multiplication, matrix inversion, and sampling from the full conditionals. Our GPU implementations of Bayesian Lasso and multinomial logistic regression achieve 100-fold speedups on mid-level and high-end GPUs. Substantial speedups of 25 fold can also be achieved on older and lower end GPUs. Samplers are implemented in OpenCL and can be used on any type of GPU and other types of computational units, thereby being convenient and advantageous in practice compared to related work.
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
offers the potential for parallelizing skyline computation across thousands of cores. However, attempts to port skyline algorithms to the GPU have prioritized throughput and failed to outperform sequential algorithms. In this paper, we introduce a new skyline algorithm, designed for the GPU, that uses...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...
GPU real-time processing in NA62 trigger system
Ammendola, R.; Biagioni, A.; Chiozzi, S.; Cretaro, P.; Di Lorenzo, S.; Fantechi, R.; Fiorini, M.; Frezza, O.; Lamanna, G.; Lo Cicero, F.; Lonardo, A.; Martinelli, M.; Neri, I.; Paolucci, P. S.; Pastorelli, E.; Piandani, R.; Piccini, M.; Pontisso, L.; Rossetti, D.; Simula, F.; Sozzi, M.; Vicini, P.
2017-01-01
A commercial Graphics Processing Unit (GPU) is used to build a fast Level 0 (L0) trigger system tested parasitically with the TDAQ (Trigger and Data Acquisition systems) of the NA62 experiment at CERN. In particular, the parallel computing power of the GPU is exploited to perform real-time fitting in the Ring Imaging CHerenkov (RICH) detector. Direct GPU communication using a FPGA-based board has been used to reduce the data transmission latency. The performance of the system for multi-ring reconstrunction obtained during the NA62 physics run will be presented.
GPU Linear algebra extensions for GNU/Octave
Bosi, L. B.; Mariotti, M.; Santocchia, A.
2012-06-01
Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance.
Parallelization and checkpointing of GPU applications through program transformation
Energy Technology Data Exchange (ETDEWEB)
Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)
2012-01-01
GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
GPU-based Ray Tracing of Dynamic Scenes
Directory of Open Access Journals (Sweden)
Christopher Lux
2010-08-01
Full Text Available Interactive ray tracing of non-trivial scenes is just becoming feasible on single graphics processing units (GPU. Recent work in this area focuses on building effective acceleration structures, which work well under the constraints of current GPUs. Most approaches are targeted at static scenes and only allow navigation in the virtual scene. So far support for dynamic scenes has not been considered for GPU implementations. We have developed a GPU-based ray tracing system for dynamic scenes consisting of a set of individual objects. Each object may independently move around, but its geometry and topology are static.
Wong, Un-Hong; Aoki, Takayuki; Wong, Hon-Cheng
2014-07-01
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct-MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU-MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.
GPU-S2S:面向GPU的源到源翻译转化%GPU-S2S: a source to source compiler for GPU
Institute of Scientific and Technical Information of China (English)
李丹; 曹海军; 董小社; 张保
2012-01-01
To address the problem of poor software portability and programmability of a graphic processing unit ( GPU ) , and to facilitate the development of parallel programs on GPU, this study proposed a novel directive based compiler guided approach, and then the GPU-S2S, a prototypic tool for automatic source-to-source translation, was implemented through combining automatic mapping with static compilation configuration, which is capable of translating a C sequential program with directives into a compute unified device architecture (CUDA) program. The experimental results show that CUDA codes generated by the GPU-S2S can achieve comparable performance to that of CUDA benchmarks provided by NVIDIA CUDA SDK, and have significant performance improvements compared to its original C sequential codes.%针对图形处理器(GPU)架构下的软件可移植性、可编程性差的问题,为了便于在GPU上开发并行程序,通过自动映射与静态编译相结合,提出了一种新的基于制导语句控制的编译优化方法,实现了一个源到源的自动转化工具GPU-S2S,它能够将插入了制导语句的串行C程序转化为统一计算架构(CUDA)程序.实验结果表明,经GPU-S2S转化生成的代码和英伟达(NVIDIA)提供的基准测试代码具有相当的性能；与原串行程序在CPU上执行相比,转换后的并行程序在GPU上能够获取显著的性能提升.
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Directory of Open Access Journals (Sweden)
Andrea Manconi
Full Text Available Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
GPU-BSM: a GPU-based tool to map bisulfite-treated reads.
Manconi, Andrea; Orro, Alessandro; Manca, Emanuele; Armano, Giuliano; Milanesi, Luciano
2014-01-01
Cytosine DNA methylation is an epigenetic mark implicated in several biological processes. Bisulfite treatment of DNA is acknowledged as the gold standard technique to study methylation. This technique introduces changes in the genomic DNA by converting cytosines to uracils while 5-methylcytosines remain nonreactive. During PCR amplification 5-methylcytosines are amplified as cytosine, whereas uracils and thymines as thymine. To detect the methylation levels, reads treated with the bisulfite must be aligned against a reference genome. Mapping these reads to a reference genome represents a significant computational challenge mainly due to the increased search space and the loss of information introduced by the treatment. To deal with this computational challenge we devised GPU-BSM, a tool based on modern Graphics Processing Units. Graphics Processing Units are hardware accelerators that are increasingly being used successfully to accelerate general-purpose scientific applications. GPU-BSM is a tool able to map bisulfite-treated reads from whole genome bisulfite sequencing and reduced representation bisulfite sequencing, and to estimate methylation levels, with the goal of detecting methylation. Due to the massive parallelization obtained by exploiting graphics cards, GPU-BSM aligns bisulfite-treated reads faster than other cutting-edge solutions, while outperforming most of them in terms of unique mapped reads.
Fast calculation of HELAS amplitudes using graphics processing unit (GPU)
Hagiwara, K; Okamura, N; Rainwater, D L; Stelzer, T
2009-01-01
We use the graphics processing unit (GPU) for fast calculations of helicity amplitudes of physics processes. As our first attempt, we compute $u\\overline{u}\\to n\\gamma$ ($n=2$ to 8) processes in $pp$ collisions at $\\sqrt{s} = 14$TeV by transferring the MadGraph generated HELAS amplitudes (FORTRAN) into newly developed HEGET ({\\bf H}ELAS {\\bf E}valuation with {\\bf G}PU {\\bf E}nhanced {\\bf T}echnology) codes written in CUDA, a C-platform developed by NVIDIA for general purpose computing on the GPU. Compared with the usual CPU programs, we obtain 40-150 times better performance on the GPU.
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Directory of Open Access Journals (Sweden)
Che-Lun Hung
2014-01-01
Full Text Available With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Importance of Explicit Vectorization for CPU and GPU Software Performance
Dickson, Neil G; Hamze, Firas
2010-01-01
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.
Connectivity-Based Segmentation for GPU-Accelerated Mesh Decompression
Institute of Scientific and Technical Information of China (English)
Jie-Yi Zhao; Min Tang; Ruo-Feng Tong
2012-01-01
We present a novel algorithm to partition large 3D meshes for GPU-accelerated decompression.Our formulation focuses on minimizing the replicated vertices between patches,and balancing the numbers of faces of patches for efficient parallel computing.First we generate a topology model of the original mesh and remove vertex positions.Then we assign the centers of patches using geodesic farthest point sampling and cluster the faces according to the geodesic distance to the centers.After the segmentation we swap boundary faces to fix jagged boundaries and store the boundary vertices for whole-mesh preservation.The decompression of each patch runs on a thread of GPU,and we evaluate its performance on various large benchmarks.In practice,the GPU-based decompression algorithm runs more than 48x faster on NVIDIA GeForce GTX 580 GPU compared with that on the CPU using single core.
GPU Accelerated Semiclassical Initial Value Representation Molecular Dynamics
Tamascelli, Dario; Conte, Riccardo; Ceotto, Michele
2013-01-01
This paper presents a graphics processing units (GPUs) implementation of the semiclassical initial value representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the CUDA implementation of the semiclassical code are provided. 4 molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (C2075 and K20) versus two CPUs (intel core i5 and Intel Xeon E5-2687W) shows that the CPU code scales linearly, whereas the GPU CUDA code roughly constantly for most of the trajectory range considered. Critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
3D- VISUALIZATION BY RAYTRACING IMAGE SYNTHESIS ON GPU
Directory of Open Access Journals (Sweden)
Al-Oraiqat Anas M.
2016-06-01
Full Text Available This paper presents a realization of the approach to spatial 3D stereo of visualization of 3D images with use parallel Graphics processing unit (GPU. The experiments of realization of synthesis of images of a 3D stage by a method of trace of beams on GPU with Compute Unified Device Architecture (CUDA have shown that 60 % of the time is spent for the decision of a computing problem approximately, the major part of time (40 % is spent for transfer of data between the central processing unit and GPU for calculations and the organization process of visualization. The study of the influence of increase in the size of the GPU network at the speed of calculations showed importance of the correct task of structure of formation of the parallel computer network and general mechanism of parallelization.
GPU accelerated numerical simulations of viscoelastic phase separation model.
Yang, Keda; Su, Jiaye; Guo, Hongxia
2012-07-05
We introduce a complete implementation of viscoelastic model for numerical simulations of the phase separation kinetics in dynamic asymmetry systems such as polymer blends and polymer solutions on a graphics processing unit (GPU) by CUDA language and discuss algorithms and optimizations in details. From studies of a polymer solution, we show that the GPU-based implementation can predict correctly the accepted results and provide about 190 times speedup over a single central processing unit (CPU). Further accuracy analysis demonstrates that both the single and the double precision calculations on the GPU are sufficient to produce high-quality results in numerical simulations of viscoelastic model. Therefore, the GPU-based viscoelastic model is very promising for studying many phase separation processes of experimental and theoretical interests that often take place on the large length and time scales and are not easily addressed by a conventional implementation running on a single CPU.
Acceleration of a QM/MM-QMC simulation using GPU.
Uejima, Yutaka; Terashima, Tomoharu; Maezono, Ryo
2011-07-30
We accelerated an ab initio molecular QMC calculation by using GPGPU. Only the bottle-neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a (single core CPU with double precision), getting 23.6 (11.0) times faster calculations in single (double) precision treatments on GPU. The energy deviation caused by the single precision treatment was found to be within the accuracy required in the calculation, ∼10(-5) hartree. The accelerated computational nodes mounting GPU are combined to form a hybrid MPI cluster on which we confirmed the performance linearly scales to the number of nodes.
The Coming Role of GPU in Computational Geodynamics (Invited)
Yuen, D. A.; Knepley, M. G.; Erlebacher, G.; Wright, G. B.
2009-12-01
With the proliferation of GPU ( graphics accelerator board) the computing landscape has changed enormously in the last 3 years. The new additional capabilities of the GPU , such as larger shared memories and load-store operations , allow it to be considered as a viable stand-alone computational and visualization engine. Today the massive threading and computing capability of GPU can deliver at least an order of magnitude performance over the multi-core CPU architecture. The cost of a GPU system is also considerably cheaper than a CPU cluster by more than an order of magnitude.The introduction of CUDA and ancillary software aids, such as Jackets, have allowed the rapid translation of many venerable codes into software usable on GPU. We will discuss our experience acquired over the past year in attacking five different computational problems in the geosciences, using the GPU. They include (1.) 3-D seismic wave propagation with the spectral-element method (2.)2-D shallow water equation as applied to tsunami wave propagation, using finite-differences (3.) 3-D mantle convection with constant viscosity using a 4th order compact finite-difference operator (4.) non-linear heat-diffusion equation in 2-D using a collocation method based on radial basis functions over an elliptical area . Grid points are divided so as to lie on a centroidal Voronoi mesh . Derivatives are calculated at each grid point using a point-dependent stencil computed from the nearest neighbors .(5.) Stokes flow with variable viscosity by means of a pre-conditioner calculated on the GPU based on the vortex method using Green’s functions, along with the radial basis functions and the fast multi-pole method. The Krylov method is used on the CPU for the final iterative step .We will discuss the relative speed-ups of the GPU over the CPU in each of these cases. We will point out the need to go to more computationally intensive mode with multiple GPUs, which calls for key CPUs to control the message
LDPC Decoding on GPU for Mobile Device
Directory of Open Access Journals (Sweden)
Yiqin Lu
2016-01-01
Full Text Available A flexible software LDPC decoder that exploits data parallelism for simultaneous multicode words decoding on the mobile device is proposed in this paper, supported by multithreading on OpenCL based graphics processing units. By dividing the check matrix into several parts to make full use of both the local memory and private memory on GPU and properly modify the code capacity each time, our implementation on a mobile phone shows throughputs above 100 Mbps and delay is less than 1.6 millisecond in decoding, which make high-speed communication like video calling possible. To realize efficient software LDPC decoding on the mobile device, the LDPC decoding feature on communication baseband chip should be replaced to save the cost and make it easier to upgrade decoder to be compatible with a variety of channel access schemes.
GPU-accelerated micromagnetic simulations using cloud computing
Jermain, C. L.; Rowlands, G. E.; Buhrman, R. A.; Ralph, D. C.
2016-03-01
Highly parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
GPU-accelerated micromagnetic simulations using cloud computing
Jermain, C L; Buhrman, R A; Ralph, D C
2015-01-01
Highly-parallel graphics processing units (GPUs) can improve the speed of micromagnetic simulations significantly as compared to conventional computing using central processing units (CPUs). We present a strategy for performing GPU-accelerated micromagnetic simulations by utilizing cost-effective GPU access offered by cloud computing services with an open-source Python-based program for running the MuMax3 micromagnetics code remotely. We analyze the scaling and cost benefits of using cloud computing for micromagnetics.
GPU-based large-scale visualization
Hadwiger, Markus
2013-11-19
Recent advances in image and volume acquisition as well as computational advances in simulation have led to an explosion of the amount of data that must be visualized and analyzed. Modern techniques combine the parallel processing power of GPUs with out-of-core methods and data streaming to enable the interactive visualization of giga- and terabytes of image and volume data. A major enabler for interactivity is making both the computational and the visualization effort proportional to the amount of data that is actually visible on screen, decoupling it from the full data size. This leads to powerful display-aware multi-resolution techniques that enable the visualization of data of almost arbitrary size. The course consists of two major parts: An introductory part that progresses from fundamentals to modern techniques, and a more advanced part that discusses details of ray-guided volume rendering, novel data structures for display-aware visualization and processing, and the remote visualization of large online data collections. You will learn how to develop efficient GPU data structures and large-scale visualizations, implement out-of-core strategies and concepts such as virtual texturing that have only been employed recently, as well as how to use modern multi-resolution representations. These approaches reduce the GPU memory requirements of extremely large data to a working set size that fits into current GPUs. You will learn how to perform ray-casting of volume data of almost arbitrary size and how to render and process gigapixel images using scalable, display-aware techniques. We will describe custom virtual texturing architectures as well as recent hardware developments in this area. We will also describe client/server systems for distributed visualization, on-demand data processing and streaming, and remote visualization. We will describe implementations using OpenGL as well as CUDA, exploiting parallelism on GPUs combined with additional asynchronous
Real-Time Incompressible Fluid Simulation on the GPU
Directory of Open Access Journals (Sweden)
Xiao Nie
2015-01-01
Full Text Available We present a parallel framework for simulating incompressible fluids with predictive-corrective incompressible smoothed particle hydrodynamics (PCISPH on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of state-of-the-art GPUs. In PCISPH-based simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this time-consuming step is introduced. Moreover, we discuss several optimization techniques including using fast on-chip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of real-time fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPU-based SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speed-up factor of up to 23 on a high-end NVIDIA GPU in comparison to single-threaded CPU-based implementation.
DEFF Research Database (Denmark)
Zhang, Zhe; Thomsen, Ole Cornelius; Andersen, Michael A. E.
2009-01-01
input voltage combined with load current feedback using PI controller with anti-windup scheme to realize closed-loop control of the whole system, and verify the feasibility of the control scheme proposed by simulation. A 1kW prototype controlled by TMS320F2808 DSP is implemented and tested. Experimental......Abstract-In this paper, an extended run time DC UPS system structure with fuel cell and supercapacitor is investigated. A wide input range bi-directional dc-dc converter is described along with the phase-shift modulation scheme and phase-shift with duty cycle control, in different modes...
Parallel Implementation of Similarity Measures on GPU Architecture using CUDA
Directory of Open Access Journals (Sweden)
Kuldeep Yadav
2012-02-01
Full Text Available Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU is more popular now-a-days due to their speed, programmability,low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core computer system to speedup execution of algorithms and in the field of Content based medical image retrieval (CBMIR, Euclidean distance and Mahalanobis plays an important role in retrieval of images. Distance formula is important because it plays an important role in matching the images. In this research work, we parallelized Euclidean distance algorithm on CUDA. CPU with Intel® Dual-CoreE5500 @ 2.80GHz and 2.0 GB of main memory which run on Windows XP (SP2. The next step was to convert this code in GPU format i.e. to run this program on GPU NVIDIA GeForce series 9500GT model having 1023MB of video memory of DDR2 type and bus width of 64bit. The graphic driver we used is of 270.81 series of NVIDIA. In this paper both the CPU and GPU version of algorithm is being implemented on the MATLABR2010. The CPU version of the algorithm is being analyzed in simple MATLAB but the GPU version is being implemented with the help of intermediate software Jacket-win-1.3.0. For using Jacket, we have to make some changes in our source code so to make the CPU and GPU to work simultaneously and thus reducing the overall computational acceleration . Our work employs extensive usage of highly multithreaded architecture of multicored GPU. An efficient use of shared memory is required to optimize parallel reduction in Compute Unified Device Architecture (CUDA, Graphic Processing Units (GPUs are emerging as powerful parallel systems at a cheap cost of a few thousand rupees.
cellGPU: Massively parallel simulations of dynamic vertex models
Sussman, Daniel M.
2017-10-01
Vertex models represent confluent tissue by polygonal or polyhedral tilings of space, with the individual cells interacting via force laws that depend on both the geometry of the cells and the topology of the tessellation. This dependence on the connectivity of the cellular network introduces several complications to performing molecular-dynamics-like simulations of vertex models, and in particular makes parallelizing the simulations difficult. cellGPU addresses this difficulty and lays the foundation for massively parallelized, GPU-based simulations of these models. This article discusses its implementation for a pair of two-dimensional models, and compares the typical performance that can be expected between running cellGPU entirely on the CPU versus its performance when running on a range of commercial and server-grade graphics cards. By implementing the calculation of topological changes and forces on cells in a highly parallelizable fashion, cellGPU enables researchers to simulate time- and length-scales previously inaccessible via existing single-threaded CPU implementations. Program Files doi:http://dx.doi.org/10.17632/6j2cj29t3r.1 Licensing provisions: MIT Programming language: CUDA/C++ Nature of problem: Simulations of off-lattice "vertex models" of cells, in which the interaction forces depend on both the geometry and the topology of the cellular aggregate. Solution method: Highly parallelized GPU-accelerated dynamical simulations in which the force calculations and the topological features can be handled on either the CPU or GPU. Additional comments: The code is hosted at https://gitlab.com/dmsussman/cellGPU, with documentation additionally maintained at http://dmsussman.gitlab.io/cellGPUdocumentation
National Aeronautics and Space Administration — The objective of this project was to use GPU enabled computing to accelerate the analyses of heat transfer and thermal effects. Graphical processing unit (GPU)...
Parallel Optimization of 3D Cardiac Electrophysiological Model Using GPU
Directory of Open Access Journals (Sweden)
Yong Xia
2015-01-01
Full Text Available Large-scale 3D virtual heart model simulations are highly demanding in computational resources. This imposes a big challenge to the traditional computation resources based on CPU environment, which already cannot meet the requirement of the whole computation demands or are not easily available due to expensive costs. GPU as a parallel computing environment therefore provides an alternative to solve the large-scale computational problems of whole heart modeling. In this study, using a 3D sheep atrial model as a test bed, we developed a GPU-based simulation algorithm to simulate the conduction of electrical excitation waves in the 3D atria. In the GPU algorithm, a multicellular tissue model was split into two components: one is the single cell model (ordinary differential equation and the other is the diffusion term of the monodomain model (partial differential equation. Such a decoupling enabled realization of the GPU parallel algorithm. Furthermore, several optimization strategies were proposed based on the features of the virtual heart model, which enabled a 200-fold speedup as compared to a CPU implementation. In conclusion, an optimized GPU algorithm has been developed that provides an economic and powerful platform for 3D whole heart simulations.
High Performance GPU-Based Fourier Volume Rendering.
Abdellah, Marwan; Eldeib, Ayman; Sharawi, Amr
2015-01-01
Fourier volume rendering (FVR) is a significant visualization technique that has been used widely in digital radiography. As a result of its (N (2)logN) time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are (N (3)) computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU) became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU) on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA) technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
High Performance GPU-Based Fourier Volume Rendering
Directory of Open Access Journals (Sweden)
Marwan Abdellah
2015-01-01
Full Text Available Fourier volume rendering (FVR is a significant visualization technique that has been used widely in digital radiography. As a result of its O(N2logN time complexity, it provides a faster alternative to spatial domain volume rendering algorithms that are O(N3 computationally complex. Relying on the Fourier projection-slice theorem, this technique operates on the spectral representation of a 3D volume instead of processing its spatial representation to generate attenuation-only projections that look like X-ray radiographs. Due to the rapid evolution of its underlying architecture, the graphics processing unit (GPU became an attractive competent platform that can deliver giant computational raw power compared to the central processing unit (CPU on a per-dollar-basis. The introduction of the compute unified device architecture (CUDA technology enables embarrassingly-parallel algorithms to run efficiently on CUDA-capable GPU architectures. In this work, a high performance GPU-accelerated implementation of the FVR pipeline on CUDA-enabled GPUs is presented. This proposed implementation can achieve a speed-up of 117x compared to a single-threaded hybrid implementation that uses the CPU and GPU together by taking advantage of executing the rendering pipeline entirely on recent GPU architectures.
GPU Lossless Hyperspectral Data Compression System
Aranki, Nazeeh I.; Keymeulen, Didier; Kiely, Aaron B.; Klimesh, Matthew A.
2014-01-01
Hyperspectral imaging systems onboard aircraft or spacecraft can acquire large amounts of data, putting a strain on limited downlink and storage resources. Onboard data compression can mitigate this problem but may require a system capable of a high throughput. In order to achieve a high throughput with a software compressor, a graphics processing unit (GPU) implementation of a compressor was developed targeting the current state-of-the-art GPUs from NVIDIA(R). The implementation is based on the fast lossless (FL) compression algorithm reported in "Fast Lossless Compression of Multispectral-Image Data" (NPO- 42517), NASA Tech Briefs, Vol. 30, No. 8 (August 2006), page 26, which operates on hyperspectral data and achieves excellent compression performance while having low complexity. The FL compressor uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. The new Consultative Committee for Space Data Systems (CCSDS) Standard for Lossless Multispectral & Hyperspectral image compression (CCSDS 123) is based on the FL compressor. The software makes use of the highly-parallel processing capability of GPUs to achieve a throughput at least six times higher than that of a software implementation running on a single-core CPU. This implementation provides a practical real-time solution for compression of data from airborne hyperspectral instruments.
Nakasato, N
2009-01-01
The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity of these problems are reduced from O(N^2) with a brute force method to O(N log N) with the tree method where N is a number of particles. In this paper, we present a parallel implementation of the tree method running on a graphic processor unit (GPU). We successfully run a simulation of structure formation in the universe very efficiently. On our system, which costs roughly $900, the run with N ~ 2.87x10^6 particles took 5.79 hours and executed 1.2x10^13 force evaluations in total. We obtained the sustained computing speed of 21.8 Gflops and the cost per Gflops of 41.6/Gflops that is two and half times better than the previous record in 2006.
Su, Xiaoquan; Wang, Xuetao; Jing, Gongchao; Ning, Kang
2014-04-01
The number of microbial community samples is increasing with exponential speed. Data-mining among microbial community samples could facilitate the discovery of valuable biological information that is still hidden in the massive data. However, current methods for the comparison among microbial communities are limited by their ability to process large amount of samples each with complex community structure. We have developed an optimized GPU-based software, GPU-Meta-Storms, to efficiently measure the quantitative phylogenetic similarity among massive amount of microbial community samples. Our results have shown that GPU-Meta-Storms would be able to compute the pair-wise similarity scores for 10 240 samples within 20 min, which gained a speed-up of >17 000 times compared with single-core CPU, and >2600 times compared with 16-core CPU. Therefore, the high-performance of GPU-Meta-Storms could facilitate in-depth data mining among massive microbial community samples, and make the real-time analysis and monitoring of temporal or conditional changes for microbial communities possible. GPU-Meta-Storms is implemented by CUDA (Compute Unified Device Architecture) and C++. Source code is available at http://www.computationalbioenergy.org/meta-storms.html.
Research on GPU Acceleration for Monte Carlo Criticality Calculation
Xu, Qi; Yu, Ganglin; Wang, Kan
2014-06-01
The Monte Carlo neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The "neutron transport step" is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the "neutron transport step" strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs.
Medical image processing on the GPU - past, present and future.
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen M
2013-12-01
Graphics processing units (GPUs) are used today in a wide range of applications, mainly because they can dramatically accelerate parallel computing, are affordable and energy efficient. In the field of medical imaging, GPUs are in some cases crucial for enabling practical use of computationally demanding algorithms. This review presents the past and present work on GPU accelerated medical image processing, and is meant to serve as an overview and introduction to existing GPU implementations. The review covers GPU acceleration of basic image processing operations (filtering, interpolation, histogram estimation and distance transforms), the most commonly used algorithms in medical imaging (image registration, image segmentation and image denoising) and algorithms that are specific to individual modalities (CT, PET, SPECT, MRI, fMRI, DTI, ultrasound, optical imaging and microscopy). The review ends by highlighting some future possibilities and challenges.
Identifying attributes of GPU programs for difficulty evaluation
Directory of Open Access Journals (Sweden)
Dale Tristram
2014-08-01
Full Text Available General-purpose computation on graphics processing units (GPGPU has great potential to accelerate many scientific models and algorithms. However, some problems are considerably more difficult to accelerate than others, and it may be challenging for those new to GPGPU to ascertain the difficulty of accelerating a particular problem. Through what was learned in the acceleration of three problems, problem attributes have been identified that can assist in the evaluation of the difficulty of accelerating a problem on a GPU. The identified attributes are a problem's available parallelism, inherent parallelism, synchronisation requirements, and data transfer requirements. We envisage that with further development, these attributes could form the foundation of a difficulty classification system that could be used to determine whether GPU acceleration is practical for a candidate GPU acceleration problem, aid in identifying appropriate techniques and optimisations, and outline the required GPGPU knowledge.
Gpu Implementation of a Viscous Flow Solver on Unstructured Grids
Xu, Tianhao; Chen, Long
2016-06-01
Graphics processing units have gained popularities in scientific computing over past several years due to their outstanding parallel computing capability. Computational fluid dynamics applications involve large amounts of calculations, therefore a latest GPU card is preferable of which the peak computing performance and memory bandwidth are much better than a contemporary high-end CPU. We herein focus on the detailed implementation of our GPU targeting Reynolds-averaged Navier-Stokes equations solver based on finite-volume method. The solver employs a vertex-centered scheme on unstructured grids for the sake of being capable of handling complex topologies. Multiple optimizations are carried out to improve the memory accessing performance and kernel utilization. Both steady and unsteady flow simulation cases are carried out using explicit Runge-Kutta scheme. The solver with GPU acceleration in this paper is demonstrated to have competitive advantages over the CPU targeting one.
Numerical simulation of lava flow using a GPU SPH model
Directory of Open Access Journals (Sweden)
Eugenio Rustico
2011-12-01
Full Text Available A smoothed particle hydrodynamics (SPH method for lava-flow modeling was implemented on a graphical processing unit (GPU using the compute unified device architecture (CUDA developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non-Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, and integration of the equation of motion, and it is computed on the GPU, fully exploiting the computational power. The simulation speed achieved is one to two orders of magnitude faster than the equivalent central processing unit (CPU code. This GPU implementation of SPH allows high resolution SPH modeling in hours and days, rather than in weeks and months, on inexpensive and readily available hardware.
gPGA: GPU Accelerated Population Genetics Analyses.
Directory of Open Access Journals (Sweden)
Chunbao Zhou
Full Text Available The isolation with migration (IM model is important for studies in population genetics and phylogeography. IM program applies the IM model to genetic data drawn from a pair of closely related populations or species based on Markov chain Monte Carlo (MCMC simulations of gene genealogies. But computational burden of IM program has placed limits on its application.With strong computational power, Graphics Processing Unit (GPU has been widely used in many fields. In this article, we present an effective implementation of IM program on one GPU based on Compute Unified Device Architecture (CUDA, which we call gPGA.Compared with IM program, gPGA can achieve up to 52.30X speedup on one GPU. The evaluation results demonstrate that it allows datasets to be analyzed effectively and rapidly for research on divergence population genetics. The software is freely available with source code at https://github.com/chunbaozhou/gPGA.
Real-time Flame Rendering with GPU and CUDA
Directory of Open Access Journals (Sweden)
Wei Wei
2011-02-01
Full Text Available This paper proposes a method of flame simulation based on Lagrange process and chemical composition, which was non-grid and the problems associated with there grids were overcome. The turbulence movement of flame was described by Lagrange process and chemical composition was added into flame simulation which increased the authenticity of flame. For real-time applications, this paper simplified the EMST model. GPU-based particle system combined with OpenGL VBO and PBO unique technology was used to accelerate finally, the speed of vertex and pixel data interaction between CPU and GPU increased two orders of magnitude, frame rate of rendering increased by 30%, which achieved fast dynamic flame real-time simulation. For further real-time applications, this paper presented a strategy to implement flame simulation with CUDA on GPU, which achieved a speed up to 2.5 times the previous implementation.
A High Performance Image Authentication Algorithm on GPU with CUDA
Directory of Open Access Journals (Sweden)
Caiwei Lin
2011-03-01
Full Text Available There has been large amounts of research on image authentication method. Many of the schemes perform well in verification results; however, most of them are time-consuming in traditional serial manners. And improving the efficiency of authentication process has become one of the challenges in image authentication field today. In the future, it’s a trend that authentication system with the properties of high performance, real-time, flexible and ease for development. In this paper, we present a CUDA-based implementation of an image authentication algorithm with NVIDIA’s Tesla C1060 GPU devices. Comparing with the original implementation on CPU, our CUDA-based implementation works 20x-50x faster with single GPU device. And experiment shows that, by using two GPUs, the performance gains can be further improved around 1.2 times in contras to single GPU.
A Novel Architecture of Multi-GPU Computing Card
Directory of Open Access Journals (Sweden)
Sen Guo
2013-08-01
Full Text Available The data transmission between GPUS in the existing multi_GPU computing card is often through PCIE which is in relative low speed, so the PCIE has become bottleneck of Overall performance. A novel architecture of multi_GPU computing card have been proposed in this paper: A multi-channel memory which have multiple interfaces is added, including one common interface shared by different GPUs, which is connected with a FPGA arbitration circuit and several other interfaces connected with dedicated GPUs frame buffer independently, and this multi-channel memory is called "global shared memory". The result of a simulation of accelerating computer tomography algebraic reconstruction on multi-GPU demonstrates effectiveness of this approach.
Implementation of a Parallel Tree Method on a GPU
Nakasato, Naohito
2011-01-01
The kd-tree is a fundamental tool in computer science. Among other applications, the application of kd-tree search (by the tree method) to the fast evaluation of particle interactions and neighbor search is highly important, since the computational complexity of these problems is reduced from O(N^2) for a brute force method to O(N log N) for the tree method, where N is the number of particles. In this paper, we present a parallel implementation of the tree method running on a graphics processing unit (GPU). We present a detailed description of how we have implemented the tree method on a Cypress GPU. An optimization that we found important is localized particle ordering to effectively utilize cache memory. We present a number of test results and performance measurements. Our results show that the execution of the tree traversal in a force calculation on a GPU is practical and efficient.
Direct numerical simulation of turbulence using GPU accelerated supercomputers
Khajeh-Saeed, Ali; Blair Perot, J.
2013-02-01
Direct numerical simulations of turbulence are optimized for up to 192 graphics processors. The results from two large GPU clusters are compared to the performance of corresponding CPU clusters. A number of important algorithm changes are necessary to access the full computational power of graphics processors and these adaptations are discussed. It is shown that the handling of subdomain communication becomes even more critical when using GPU based supercomputers. The potential for overlap of MPI communication with GPU computation is analyzed and then optimized. Detailed timings reveal that the internal calculations are now so efficient that the operations related to MPI communication are the primary scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. This work gives a glimpse of the CFD performance issues will dominate many hardware platform in the near future.
A GPU-Computing Approach to Solar Stokes Profile Inversion
Harker, Brian J
2012-01-01
We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS (GENEtic Stokes Inversion Strategy), employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units GPUs, along with algorithms designed to exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (GA) engineered specifically for use with a GPU, we produce full-disc maps of the photospheric vector magnetic field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investigations of the Sun (SOLIS) Vector Spectromagnetograph (VSM) instrument. We show the advantages of pairing a population-parallel genetic algorithm with data-parallel GPU-computing techniques, and present an overview of the Stokes inversion problem, including a description of our adaptation to the GPU-computing paradigm. Full-disc vector ma...
Institute of Scientific and Technical Information of China (English)
JI Xin; POLLIN Sofie; LENOIR Gregory; LAFRUIT Gauthier; DEJONGHE Antoine; CATTHOOR Francky
2006-01-01
This paper introduces a video application-aware cross-layer framework for joint performance-energy optimization,considering the scenario of multiple users upstreaming real-time Motion JPEG2000 video streams to the access point of a WiFi wireless local area network and extends the PHY-MAC run-time cross-layer scheduling strategy that we introduced in (Mangharam et al., 2005; Pollin et al., 2005) to also consider congested network situations where video packets have to be dropped. We show that an optimal solution at PHY-MAC level can be highly suboptimal at application level, and then show that making the cross-layer framework application-aware through a prioritized dropping policy capitalizing on the inherent scalability of Motion JPEG2000 video streams leads to drastic average video quality improvements and inter-user quality variation reductions of as much as 10 dB PSNR, without affecting the overall energy consumption requirements.
Accelerated 3D Monte Carlo light dosimetry using a graphics processing unit (GPU) cluster
Lo, William Chun Yip; Lilge, Lothar
2010-11-01
This paper presents a basic computational framework for real-time, 3-D light dosimetry on graphics processing unit (GPU) clusters. The GPU-based approach offers a direct solution to overcome the long computation time preventing Monte Carlo simulations from being used in complex optimization problems such as treatment planning, particularly if simulated annealing is employed as the optimization algorithm. The current multi- GPU implementation is validated using a commercial light modelling software (ASAP from Breault Research Organization). It also supports the latest Fermi GPU architecture and features an interactive 3-D visualization interface. The software is available for download at http://code.google.com/p/gpu3d.
How General-Purpose can a GPU be?
Directory of Open Access Journals (Sweden)
Philip Machanick
2015-12-01
Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
A generalized GPU-based connected component labeling algorithm
Komura, Yukihiro
2016-01-01
We propose a generalized GPU-based connected component labeling (CCL) algorithm that can be applied to both various lattices and to non-lattice environments in a uniform fashion. We extend our recent GPU-based CCL algorithm without the use of conventional iteration to the generalized method. As an application of this algorithm, we deal with the bond percolation problem. We investigate bond percolation on the honeycomb and triangle lattices to confirm the correctness of this algorithm. Moreover, we deal with bond percolation on the Bethe lattice as a substitute for a network structure, and demonstrate the performance of this algorithm on those lattices.
GPU accelerated spectral finite elements on all-hex meshes
Remacle, J.-F.; Gandham, R.; Warburton, T.
2016-11-01
This paper presents a spectral element finite element scheme that efficiently solves elliptic problems on unstructured hexahedral meshes. The discrete equations are solved using a matrix-free preconditioned conjugate gradient algorithm. An additive Schwartz two-scale preconditioner is employed that allows h-independence convergence. An extensible multi-threading programming API is used as a common kernel language that allows runtime selection of different computing devices (GPU and CPU) and different threading interfaces (CUDA, OpenCL and OpenMP). Performance tests demonstrate that problems with over 50 million degrees of freedom can be solved in a few seconds on an off-the-shelf GPU.
STEM image simulation with hybrid CPU/GPU programming.
Yao, Y; Ge, B H; Shen, X; Wang, Y G; Yu, R C
2016-07-01
STEM image simulation is achieved via hybrid CPU/GPU programming under parallel algorithm architecture to speed up calculation on a personal computer (PC). To utilize the calculation power of a PC fully, the simulation is performed using the GPU core and multi-CPU cores at the same time to significantly improve efficiency. GaSb and an artificial GaSb/InAs interface with atom diffusion have been used to verify the computation. Copyright © 2016 Elsevier B.V. All rights reserved.
Numerical cosmology on the GPU with Enzo and Ramses
Gheller, Claudio; Vazza, Franco; Teyssier, Romain
2014-01-01
A number of scientific numerical codes can currently exploit GPUs with remarkable performance. In astrophysics, Enzo and Ramses are prime examples of such applications. The two codes have been ported to GPUs adopting different strategies and programming models, Enzo adopting CUDA and Ramses using OpenACC. We describe here the different solutions used for the GPU implementation of both cases. Performance benchmarks will be presented for Ramses. The results of the usage of the more mature GPU version of Enzo, adopted for a scientific project within the CHRONOS programme, will be summarised.
Numerical cosmology on the GPU with Enzo and Ramses
Gheller, C.; Wang, P.; Vazza, F.; Teyssier, R.
2015-09-01
A number of scientific numerical codes can currently exploit GPUs with remarkable performance. In astrophysics, Enzo and Ramses are prime examples of such applications. The two codes have been ported to GPUs adopting different strategies and programming models, Enzo adopting CUDA and Ramses using OpenACC. We describe here the different solutions used for the GPU implementation of both cases. Performance benchmarks will be presented for Ramses. The results of the usage of the more mature GPU version of Enzo, adopted for a scientific project within the CHRONOS programme, will be summarised.
Parallel fuzzy connected image segmentation on GPU
Zhuge, Ying; Cao, Yong; Udupa, Jayaram K.; Miller, Robert W.
2011-01-01
Purpose: Image segmentation techniques using fuzzy connectedness (FC) principles have shown their effectiveness in segmenting a variety of objects in several large applications. However, one challenge in these algorithms has been their excessive computational requirements when processing large image datasets. Nowadays, commodity graphics hardware provides a highly parallel computing environment. In this paper, the authors present a parallel fuzzy connected image segmentation algorithm implementation on NVIDIA’s compute unified device Architecture (cuda) platform for segmenting medical image data sets. Methods: In the FC algorithm, there are two major computational tasks: (i) computing the fuzzy affinity relations and (ii) computing the fuzzy connectedness relations. These two tasks are implemented as cuda kernels and executed on GPU. A dramatic improvement in speed for both tasks is achieved as a result. Results: Our experiments based on three data sets of small, medium, and large data size demonstrate the efficiency of the parallel algorithm, which achieves a speed-up factor of 24.4x, 18.1x, and 10.3x, correspondingly, for the three data sets on the NVIDIA Tesla C1060 over the implementation of the algorithm on CPU, and takes 0.25, 0.72, and 15.04 s, correspondingly, for the three data sets. Conclusions: The authors developed a parallel algorithm of the widely used fuzzy connected image segmentation method on the NVIDIA GPUs, which are far more cost- and speed-effective than both cluster of workstations and multiprocessing systems. A near-interactive speed of segmentation has been achieved, even for the large data set. PMID:21859037
GPU-acceleration of parallel unconditionally stable group explicit finite difference method
Parand, K; Hossayni, Sayyed A
2013-01-01
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this paper is to evaluate the impact of using GPU in solution of the transient diffusion type equation by parallel and stable group explicit finite difference method. To accomplish that, GPU and CPU-based (multi-core) approaches were developed. Moreover, we proposed an optimal synchronization arrangement for its implementation pseudo-code. Also, the interrelation of GPU parallel programming and initializing the algorithm variables was discussed, using numerical experiences. The GPU-approach results are faster than a much expensive parallel 8-thread CPU-based approach results. The GPU, used in this paper, is an ordinary laptop GPU (GT 335M) and is accessible for e...
GPU Accelerated Likelihoods for Stereo-Based Articulated Tracking
DEFF Research Database (Denmark)
Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster...
GPU accelerated likelihoods for stereo-based articulated tracking
DEFF Research Database (Denmark)
Friborg, Rune Møllegaard; Hauberg, Søren; Erleben, Kenny
2010-01-01
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster...
Optimizing a mobile robot control system using GPU acceleration
Tuck, Nat; McGuinness, Michael; Martin, Fred
2012-01-01
This paper describes our attempt to optimize a robot control program for the Intelligent Ground Vehicle Competition (IGVC) by running computationally intensive portions of the system on a commodity graphics processing unit (GPU). The IGVC Autonomous Challenge requires a control program that performs a number of different computationally intensive tasks ranging from computer vision to path planning. For the 2011 competition our Robot Operating System (ROS) based control system would not run comfortably on the multicore CPU on our custom robot platform. The process of profiling the ROS control program and selecting appropriate modules for porting to run on a GPU is described. A GPU-targeting compiler, Bacon, is used to speed up development and help optimize the ported modules. The impact of the ported modules on overall performance is discussed. We conclude that GPU optimization can free a significant amount of CPU resources with minimal effort for expensive user-written code, but that replacing heavily-optimized library functions is more difficult, and a much less efficient use of time.
GPU-accelerated denoising of 3D magnetic resonance images
Energy Technology Data Exchange (ETDEWEB)
Howison, Mark; Wes Bethel, E.
2014-05-29
The raw computational power of GPU accelerators enables fast denoising of 3D MR images using bilateral filtering, anisotropic diffusion, and non-local means. In practice, applying these filtering operations requires setting multiple parameters. This study was designed to provide better guidance to practitioners for choosing the most appropriate parameters by answering two questions: what parameters yield the best denoising results in practice? And what tuning is necessary to achieve optimal performance on a modern GPU? To answer the first question, we use two different metrics, mean squared error (MSE) and mean structural similarity (MSSIM), to compare denoising quality against a reference image. Surprisingly, the best improvement in structural similarity with the bilateral filter is achieved with a small stencil size that lies within the range of real-time execution on an NVIDIA Tesla M2050 GPU. Moreover, inappropriate choices for parameters, especially scaling parameters, can yield very poor denoising performance. To answer the second question, we perform an autotuning study to empirically determine optimal memory tiling on the GPU. The variation in these results suggests that such tuning is an essential step in achieving real-time performance. These results have important implications for the real-time application of denoising to MR images in clinical settings that require fast turn-around times.
High-Performance Matrix-Vector Multiplication on the GPU
DEFF Research Database (Denmark)
Sørensen, Hans Henrik Brandenborg
2012-01-01
In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing...
GPU Acceleration of Graph Matching, Clustering, and Partitioning
Fagginger Auer, B.O.|info:eu-repo/dai/nl/326659072
2013-01-01
We consider sequential algorithms for hypergraph partitioning and GPU (i.e., fine-grained shared-memory parallel) algorithms for graph partitioning and clustering. Our investigation into sequential hypergraph partitioning is concerned with the efficient construction of high-quality matchings for hyp
GPU-Boosted Camera-Only Indoor Localization
DEFF Research Database (Denmark)
Özkil, Ali Gürcan; Fan, Zhun; Kristensen, Jens Klæstrup
relies on local image features detection, description and matching; by parallelizing these computationally intensive tasks on the graphical processing unit (GPU), it is possible to do online localization using a Topometric Appearance Map. The method is developed as an integral part of a mobile service...
A survey of GPU-based medical image computing techniques.
Shi, Lin; Liu, Wen; Zhang, Heye; Xie, Yongming; Wang, Defeng
2012-09-01
Medical imaging currently plays a crucial role throughout the entire clinical applications from medical scientific research to diagnostics and treatment planning. However, medical imaging procedures are often computationally demanding due to the large three-dimensional (3D) medical datasets to process in practical clinical applications. With the rapidly enhancing performances of graphics processors, improved programming support, and excellent price-to-performance ratio, the graphics processing unit (GPU) has emerged as a competitive parallel computing platform for computationally expensive and demanding tasks in a wide range of medical image applications. The major purpose of this survey is to provide a comprehensive reference source for the starters or researchers involved in GPU-based medical image processing. Within this survey, the continuous advancement of GPU computing is reviewed and the existing traditional applications in three areas of medical image processing, namely, segmentation, registration and visualization, are surveyed. The potential advantages and associated challenges of current GPU-based medical imaging are also discussed to inspire future applications in medicine.
GPU based contouring method on grid DEM data
Tan, Liheng; Wan, Gang; Li, Feng; Chen, Xiaohui; Du, Wenlong
2017-08-01
This paper presents a novel method to generate contour lines from grid DEM data based on the programmable GPU pipeline. The previous contouring approaches often use CPU to construct a finite element mesh from the raw DEM data, and then extract contour segments from the elements. They also need a tracing or sorting strategy to generate the final continuous contours. These approaches can be heavily CPU-costing and time-consuming. Meanwhile the generated contours would be unsmooth if the raw data is sparsely distributed. Unlike the CPU approaches, we employ the GPU's vertex shader to generate a triangular mesh with arbitrary user-defined density, in which the height of each vertex is calculated through a third-order Cardinal spline function. Then in the same frame, segments are extracted from the triangles by the geometry shader, and translated to the CPU-side with an internal order in the GPU's transform feedback stage. Finally we propose a ;Grid Sorting; algorithm to achieve the continuous contour lines by travelling the segments only once. Our method makes use of multiple stages of GPU pipeline for computation, which can generate smooth contour lines, and is significantly faster than the previous CPU approaches. The algorithm can be easily implemented with OpenGL 3.3 API or higher on consumer-level PCs.
Computing 2D constrained delaunay triangulation using the GPU.
Qi, Meng; Cao, Thanh-Tung; Tan, Tiow-Seng
2013-05-01
We propose the first graphics processing unit (GPU) solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation (DT) problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges.
GPU Acceleration of Graph Matching, Clustering, and Partitioning
Fagginger Auer, B.O.
2013-01-01
We consider sequential algorithms for hypergraph partitioning and GPU (i.e., fine-grained shared-memory parallel) algorithms for graph partitioning and clustering. Our investigation into sequential hypergraph partitioning is concerned with the efficient construction of high-quality matchings for hyp
Cloud GPU-based simulations for SQUAREMR
Kantasis, George; Xanthis, Christos G.; Haris, Kostas; Heiberg, Einar; Aletras, Anthony H.
2017-01-01
Quantitative Magnetic Resonance Imaging (MRI) is a research tool, used more and more in clinical practice, as it provides objective information with respect to the tissues being imaged. Pixel-wise T1 quantification (T1 mapping) of the myocardium is one such application with diagnostic significance. A number of mapping sequences have been developed for myocardial T1 mapping with a wide range in terms of measurement accuracy and precision. Furthermore, measurement results obtained with these pulse sequences are affected by errors introduced by the particular acquisition parameters used. SQUAREMR is a new method which has the potential of improving the accuracy of these mapping sequences through the use of massively parallel simulations on Graphical Processing Units (GPUs) by taking into account different acquisition parameter sets. This method has been shown to be effective in myocardial T1 mapping; however, execution times may exceed 30 min which is prohibitively long for clinical applications. The purpose of this study was to accelerate the construction of SQUAREMR's multi-parametric database to more clinically acceptable levels. The aim of this study was to develop a cloud-based cluster in order to distribute the computational load to several GPU-enabled nodes and accelerate SQUAREMR. This would accommodate high demands for computational resources without the need for major upfront equipment investment. Moreover, the parameter space explored by the simulations was optimized in order to reduce the computational load without compromising the T1 estimates compared to a non-optimized parameter space approach. A cloud-based cluster with 16 nodes resulted in a speedup of up to 13.5 times compared to a single-node execution. Finally, the optimized parameter set approach allowed for an execution time of 28 s using the 16-node cluster, without compromising the T1 estimates by more than 10 ms. The developed cloud-based cluster and optimization of the parameter set reduced
Accelerated protein structure comparison using TM-score-GPU.
Hung, Ling-Hong; Samudrala, Ram
2012-08-15
Accurate comparisons of different protein structures play important roles in structural biology, structure prediction and functional annotation. The root-mean-square-deviation (RMSD) after optimal superposition is the predominant measure of similarity due to the ease and speed of computation. However, global RMSD is dependent on the length of the protein and can be dominated by divergent loops that can obscure local regions of similarity. A more sophisticated measure of structure similarity, Template Modeling (TM)-score, avoids these problems, and it is one of the measures used by the community-wide experiments of critical assessment of protein structure prediction to compare predicted models with experimental structures. TM-score calculations are, however, much slower than RMSD calculations. We have therefore implemented a very fast version of TM-score for Graphical Processing Units (TM-score-GPU), using a new and novel hybrid Kabsch/quaternion method for calculating the optimal superposition and RMSD that is designed for parallel applications. This acceleration in speed allows TM-score to be used efficiently in computationally intensive applications such as for clustering of protein models and genome-wide comparisons of structure. TM-score-GPU was applied to six sets of models from Nutritious Rice for the World for a total of 3 million comparisons. TM-score-GPU is 68 times faster on an ATI 5870 GPU, on average, than the original CPU single-threaded implementation on an AMD Phenom II 810 quad-core processor. The complete source, including the GPU code and the hybrid RMSD subroutine, can be downloaded and used without restriction at http://software.compbio.washington.edu/misc/downloads/tmscore/. The implementation is in C++/OpenCL.
GHOSTM: a GPU-accelerated homology search tool for metagenomics.
Directory of Open Access Journals (Sweden)
Shuji Suzuki
Full Text Available BACKGROUND: A large number of sensitive homology searches are required for mapping DNA sequence fragments to known protein sequences in public and private databases during metagenomic analysis. BLAST is currently used for this purpose, but its calculation speed is insufficient, especially for analyzing the large quantities of sequence data obtained from a next-generation sequencer. However, faster search tools, such as BLAT, do not have sufficient search sensitivity for metagenomic analysis. Thus, a sensitive and efficient homology search tool is in high demand for this type of analysis. METHODOLOGY/PRINCIPAL FINDINGS: We developed a new, highly efficient homology search algorithm suitable for graphics processing unit (GPU calculations that was implemented as a GPU system that we called GHOSTM. The system first searches for candidate alignment positions for a sequence from the database using pre-calculated indexes and then calculates local alignments around the candidate positions before calculating alignment scores. We implemented both of these processes on GPUs. The system achieved calculation speeds that were 130 and 407 times faster than BLAST with 1 GPU and 4 GPUs, respectively. The system also showed higher search sensitivity and had a calculation speed that was 4 and 15 times faster than BLAT with 1 GPU and 4 GPUs. CONCLUSIONS: We developed a GPU-optimized algorithm to perform sensitive sequence homology searches and implemented the system as GHOSTM. Currently, sequencing technology continues to improve, and sequencers are increasingly producing larger and larger quantities of data. This explosion of sequence data makes computational analysis with contemporary tools more difficult. We developed GHOSTM, which is a cost-efficient tool, and offer this tool as a potential solution to this problem.
Geological Visualization System with GPU-Based Interpolation
Huang, L.; Chen, K.; Lai, Y.; Chang, P.; Song, S.
2011-12-01
There has been a large number of research using parallel-processing GPU to accelerate the computation. In Near Surface Geology efficient interpolations are critical for proper interpretation of measured data. Additionally, an appropriate interpolation method for generating proper results depends on the factors such as the dense of the measured locations and the estimation model. Therefore, fast interpolation process is needed to efficiently find a proper interpolation algorithm for a set of collected data. However, a general CPU framework has to process each computation in a sequential manner and is not efficient enough to handle a large number of interpolation generally needed in Near Surface Geology. When carefully observing the interpolation processing, the computation for each grid point is independent from all other computation. Therefore, the GPU parallel framework should be an efficient technology to accelerate the interpolation process which is critical in Near Surface Geology. Thus in this paper we design a geological visualization system whose core includes a set of interpolation algorithms including Nearest Neighbor, Inverse Distance and Kriging. All these interpolation algorithms are implemented using both the CPU framework and GPU framework. The comparison between CPU and GPU implementation in the aspect of precision and processing speed shows that parallel computation can accelerate the interpolation process and also demonstrates the possibility of using GPU-equipped personal computer to replace the expensive workstation. Immediate update at the measurement site is the dream of geologists. In the future the parallel and remote computation ability of cloud will be explored to make the mobile computation on the measurement site possible.
Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S
2016-12-01
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/.
High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.
Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram
2014-01-01
We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.
Gigou, Pierre-Yves; Dion, Tommy; Asselin, Audrey; Berrigan, Felix; Goulet, Eric D. B.
2012-01-01
This study compared the effect of pre-exercise hyperhydration (PEH) and pre-exercise euhydration (PEE) upon treadmill running time-trial (TT) performance in the heat. Six highly trained runners or triathletes underwent two 18 km TT runs (~28 °C, 25%–30% RH) on a motorized treadmill, in a randomized, crossover fashion, while being euhydrated or after hyperhydration with 26 mL/kg bodyweight (BW) of a 130 mmol/L sodium solution. Subjects then ran four successive 4.5 km blocks alternating between 2.5 km at 1% and 2 km at 6% gradient, while drinking a total of 7 mL/kg BW of a 6% sports drink solution (Gatorade, USA). PEH increased BW by 1.00 ± 0.34 kg (P Running TT time did not differ between groups (PEH: 85.6 ± 11.6 min; PEE: 85.3 ± 9.6 min, P = 0.82). Heart rate (5 ± 1 beats/min) and rectal (0.3 ± 0.1 °C) and body (0.2 ± 0.1 °C) temperatures of PEE were higher than those of PEH (P running TT performance under warm conditions in highly-trained runners drinking ~500 mL sports drink during exercise. PMID:23016126
Directory of Open Access Journals (Sweden)
Jochen Delrue
2016-07-01
Full Text Available In a sample of long distance runners, we examined the role of type of intrapersonal achievement goals (i.e., approach versus avoidance and type of underlying reasons (i.e., autonomous and controlled, assessed prior to the race, as predictors of both pre-race (e.g., race appraisals and post-race (e.g., flow experience outcomes. Of 221 (62.4% males runners, 111 reported pursuing an intrapersonal-approach goal (i.e., doing better than before as their dominant or preferred achievement goal for the race, while 86 prioritized intrapersonal-avoidance goals (i.e., avoiding to perform worse than before. Regression and path analyses showed that the type of achievement goals predicted none of the outcomes except for running time, with approach goals predicting better performance when compared to avoidance goals. Path analyses revealed that autonomous reasons underlying intrapersonal goal pursuit related positively to pre-race challenge appraisals, performance and, via need satisfaction, to flow experience. Interestingly, controlled reasons positively related to pre-race threat appraisals and positively predicted both positive and negative self-talk, with both yielding opposing relations with flow. These findings complement past research on the intersection between the Achievement Goal Approach and Self-Determination Theory and highlight the value of studying the reasons underlying intrapersonal achievement goals.
面向节点异构GPU集群的编程框架%Programming Framework for Node Heterogeneous GPU Cluster
Institute of Scientific and Technical Information of China (English)
盛冲冲; 胡新明; 李佳佳; 吴百锋
2015-01-01
基于异构GPU集群的主流编程方法是MPI与CUDA的混合编程或者其简单变形。因为对底层的集群架构不透明,程序员对GPU集群采用MPI与CUDA编写应用程序时需要人为考虑硬件计算资源,复杂度高、可移植性差。为此,基于数据流模型设计和实现面向节点异构GPU集群体系结构的新型编程框架分布式并行编程框架(DISPAR)。 DISPAR框架包含2个子系统：(1)代码转换系统StreamCC,是DISPAR源代码到MPI+CUDA代码的自动转换器。(2)任务分配系统StreamMAP,具有自动发现异构计算资源和任务自动映射功能的运行时系统。实验结果表明,该框架有效简化了GPU集群应用程序的编写,可高效地利用异构GPU集群的计算资源,且程序不依赖于硬件平台,可移植性较好。%The mainly used programming method for heterogeneous GPU cluster is hybrid MPI/CUDA or its simple deformation. However,because of its transparency to underlying architecture when using hybrid MPI/CUDA to write code for heterogeneous GPU cluster,programmers tend to need detailed knowledge of the hardware resources,which makes the program more complicated and less portable. This paper presents Distributed Parallel Programming Framework ( DISPAR) , a new programming framework for node-level heterogeneous GPU cluster based on data flow model. DISPAR framework contains two sub-systems, StreamCC and StreamMAP. StreamCC is a code conversion tool which coverts DISPAR code into hybrid MPI/CUDA code. StreamMAP is a run-time system which can detect heterogeneous computing resources and map the tasks to appropriate computing units automatically. Experimental results show that the methods can make efficient use of the computing resources and simplify the programming on heterogeneous GPU cluster. Besides,it has better portability and scalability as the code does not rely on the execution platform.
Energy Technology Data Exchange (ETDEWEB)
Almeida, Adino Americo Heimlich
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in two typical problems of Nuclear area. The neutron transport simulation using Monte Carlo method and solve heat equation in a bi-dimensional domain by finite differences method. To achieve this, we develop parallel algorithms for GPU and CPU in the two problems described above. The comparison showed that the GPU-based approach is faster than the CPU in a computer with two quad core processors, without precision loss. (author)
Miki, T; Wang, X; Aoki, T; Imai, Y; Ishikawa, T; Takase, K; Yamaguchi, T
2012-01-01
In this paper, we propose a novel patient-specific method of modelling pulmonary airflow using graphics processing unit (GPU) computation that can be applied in medical practice. To overcome the barriers imposed by computation speed, installation price and footprint to the application of computational fluid dynamics, we focused on GPU computation and the lattice Boltzmann method (LBM). The GPU computation and LBM are compatible due to the characteristics of the GPU. As the optimisation of data access is essential for the performance of the GPU computation, we developed an adaptive meshing method, in which an airway model is covered by isotropic subdomains consisting of a uniform Cartesian mesh. We found that 4(3) size subdomains gave the best performance. The code was also tested on a small GPU cluster to confirm its performance and applicability, as the price and footprint are reasonable for medical applications.
Gigou, Pierre-Yves; Dion, Tommy; Asselin, Audrey; Berrigan, Felix; Goulet, Eric D B
2012-08-01
This study compared the effect of pre-exercise hyperhydration (PEH) and pre-exercise euhydration (PEE) upon treadmill running time-trial (TT) performance in the heat. Six highly trained runners or triathletes underwent two 18 km TT runs (~28 °C, 25%-30% RH) on a motorized treadmill, in a randomized, crossover fashion, while being euhydrated or after hyperhydration with 26 mL/kg bodyweight (BW) of a 130 mmol/L sodium solution. Subjects then ran four successive 4.5 km blocks alternating between 2.5 km at 1% and 2 km at 6% gradient, while drinking a total of 7 mL/kg BW of a 6% sports drink solution (Gatorade, USA). PEH increased BW by 1.00 ± 0.34 kg (P < 0.01) and, compared with PEE, reduced BW loss from 3.1% ± 0.3% (EUH) to 1.4% ± 0.4% (HYP) (P < 0.01) during exercise. Running TT time did not differ between groups (PEH: 85.6 ± 11.6 min; PEE: 85.3 ± 9.6 min, P = 0.82). Heart rate (5 ± 1 beats/min) and rectal (0.3 ± 0.1 °C) and body (0.2 ± 0.1 °C) temperatures of PEE were higher than those of PEH (P < 0.05). There was no significant difference in abdominal discomfort and perceived exertion or heat stress between groups. Our results suggest that pre-exercise sodium-induced hyperhydration of a magnitude of 1 L does not alter 80-90 min running TT performance under warm conditions in highly-trained runners drinking ~500 mL sports drink during exercise.
Directory of Open Access Journals (Sweden)
Eric D. B. Goulet
2012-08-01
Full Text Available This study compared the effect of pre-exercise hyperhydration (PEH and pre-exercise euhydration (PEE upon treadmill running time-trial (TT performance in the heat. Six highly trained runners or triathletes underwent two 18 km TT runs (~28 °C, 25%–30% RH on a motorized treadmill, in a randomized, crossover fashion, while being euhydrated or after hyperhydration with 26 mL/kg bodyweight (BW of a 130 mmol/L sodium solution. Subjects then ran four successive 4.5 km blocks alternating between 2.5 km at 1% and 2 km at 6% gradient, while drinking a total of 7 mL/kg BW of a 6% sports drink solution (Gatorade, USA. PEH increased BW by 1.00 ± 0.34 kg (P < 0.01 and, compared with PEE, reduced BW loss from 3.1% ± 0.3% (EUH to 1.4% ± 0.4% (HYP (P < 0.01 during exercise. Running TT time did not differ between groups (PEH: 85.6 ± 11.6 min; PEE: 85.3 ± 9.6 min, P = 0.82. Heart rate (5 ± 1 beats/min and rectal (0.3 ± 0.1 °C and body (0.2 ± 0.1 °C temperatures of PEE were higher than those of PEH (P < 0.05. There was no significant difference in abdominal discomfort and perceived exertion or heat stress between groups. Our results suggest that pre-exercise sodium-induced hyperhydration of a magnitude of 1 L does not alter 80–90 min running TT performance under warm conditions in highly-trained runners drinking ~500 mL sports drink during exercise.
Combating the Reliability Challenge of GPU Register File at Low Supply Voltage
Energy Technology Data Exchange (ETDEWEB)
Tan, Jingweijia; Song, Shuaiwen; Yan, Kaige; Fu, Xin; Marquez, Andres; Kerbyson, Darren J.
2016-09-11
Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. We propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages.
Nagaoka, Tomoaki; Watanabe, Soichi
2012-01-01
Electromagnetic simulation with anatomically realistic computational human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the computational human model, we adapt three-dimensional FDTD code to a multi-GPU cluster environment with Compute Unified Device Architecture and Message Passing Interface. Our multi-GPU cluster system consists of three nodes. The seven GPU boards (NVIDIA Tesla C2070) are mounted on each node. We examined the performance of the FDTD calculation on multi-GPU cluster environment. We confirmed that the FDTD calculation on the multi-GPU clusters is faster than that on a multi-GPU (a single workstation), and we also found that the GPU cluster system calculate faster than a vector supercomputer. In addition, our GPU cluster system allowed us to perform the large-scale FDTD calculation because were able to use GPU memory of over 100 GB.
GPU phase-field lattice Boltzmann simulations of growth and motion of a binary alloy dendrite
Takaki, T.; Rojas, R.; Ohno, M.; Shimokawabe, T.; Aoki, T.
2015-06-01
A GPU code has been developed for a phase-field lattice Boltzmann (PFLB) method, which can simulate the dendritic growth with motion of solids in a dilute binary alloy melt. The GPU accelerated PFLB method has been implemented using CUDA C. The equiaxed dendritic growth in a shear flow and settling condition have been simulated by the developed GPU code. It has been confirmed that the PFLB simulations were efficiently accelerated by introducing the GPU computation. The characteristic dendrite morphologies which depend on the melt flow and the motion of the dendrite could also be confirmed by the simulations.
GPU-Based Tracking Algorithms for the ATLAS High-Level Trigger
Emeliyanov, D; The ATLAS collaboration
2012-01-01
GPU-accelerated event processing is one of the possible options for the ATLAS High-Level Trigger (HLT) upgrade for higher LHC luminosity. This poster presents data preparation and track finding algorithms specifically designed to run on a GPU using a “client-server” solution for hybrid CPU/GPU event processing and integration of the GPU algorithms into existing ATLAS HLT software. The resulting speed-up of event processing times obtained with high-luminosity simulated data is presented and discussed.
GPU-based 3D lower tree wavelet video encoder
Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Drummond, Leroy Anthony; Migallón, Hector
2013-12-01
The 3D-DWT is a mathematical tool of increasing importance in those applications that require an efficient processing of huge amounts of volumetric info. Other applications like professional video editing, video surveillance applications, multi-spectral satellite imaging, HQ video delivery, etc, would rather use 3D-DWT encoders to reconstruct a frame as fast as possible. In this article, we introduce a fast GPU-based encoder which uses 3D-DWT transform and lower trees. Also, we present an exhaustive analysis of the use of GPU memory. Our proposal shows good trade off between R/D, coding delay (as fast as MPEG-2 for High definition) and memory requirements (up to 6 times less memory than x264).
GPU-ACCELERATED FEM SOLVER FOR THREE DIMENSIONAL ELECTROMAGNETIC ANALYSIS
Institute of Scientific and Technical Information of China (English)
Tian Jin; Gong Li; Shi Xiaowei; Le Xu
2011-01-01
A new Graphics Processing Unit (GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis.The parallelization strategy is employed based on a new compression format called sliced ELL Four (sliced ELL-F).The sliced ELL-F format-based parallelization strategy is designed for hastening many addition,dot product,and Sparse Matrix Vector Product (SMVP) operations in the Conjugate Gradient Norm (CGN) calculation of finite element equations.The new implementation of SMVP on GPUs is evaluated.The proposed strategy executed on a GPU can efficiently solve sparse finite element equations,especially when the equations are huge sparse (size of most rows in a coefficient matrix is less than 8).Numerical results show the sliced ELL-F format-based parallelization strategy can reach significant speedups compared to Compressed Sparse Row (CSR) format.
Interactive brain shift compensation using GPU based programming
van der Steen, Sander; Noordmans, Herke Jan; Verdaasdonk, Rudolf
2009-02-01
Processing large images files or real-time video streams requires intense computational power. Driven by the gaming industry, the processing power of graphic process units (GPUs) has increased significantly. With the pixel shader model 4.0 the GPU can be used for image processing 10x faster than the CPU. Dedicated software was developed to deform 3D MR and CT image sets for real-time brain shift correction during navigated neurosurgery using landmarks or cortical surface traces defined by the navigation pointer. Feedback was given using orthogonal slices and an interactively raytraced 3D brain image. GPU based programming enables real-time processing of high definition image datasets and various applications can be developed in medicine, optics and image sciences.
Performance potential for simulating spin models on GPU
Weigel, Martin
2011-01-01
Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
GPU and APU computations of Finite Time Lyapunov Exponent fields
Conti, Christian; Rossinelli, Diego; Koumoutsakos, Petros
2012-03-01
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. The present technique harnesses data-parallel execution of many-core architectures and relies on fast and accurate evaluations of moment conserving functions for the mesh to particle interpolations. We demonstrate how the computation of FTLEs can be efficiently performed on a GPU and on an APU through OpenCL and we report over one order of magnitude improvements over multi-threaded executions in FTLE computations of bluff body flows.
Performance potential for simulating spin models on GPU
Weigel, Martin
2012-04-01
Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed.
A GPU-Based Wide-Band Radio Spectrometer
Chennamangalam, Jayanth; Jones, Glenn; Chen, Hong; Ford, John; Kepley, Amanda; Lorimer, D R; Nie, Jun; Prestage, Richard; Roshi, D Anish; Wagner, Mark; Werthimer, Dan
2014-01-01
The Graphics Processing Unit (GPU) has become an integral part of astronomical instrumentation, enabling high-performance online data reduction and accelerated online signal processing. In this paper, we describe a wide-band reconfigurable spectrometer built using an off-the-shelf GPU card. This spectrometer, when configured as a polyphase filter bank (PFB), supports a dual-polarization bandwidth of up to 1.1 GHz (or a single-polarization bandwidth of up to 2.2 GHz) on the latest generation of GPUs. On the other hand, when configured as a direct FFT, the spectrometer supports a dual-polarization bandwidth of up to 1.4 GHz (or a single-polarization bandwidth of up to 2.8 GHz).
GPU Based Software Correlators - Perspectives for VLBI2010
Hobiger, Thomas; Kimura, Moritaka; Takefuji, Kazuhiro; Oyama, Tomoaki; Koyama, Yasuhiro; Kondo, Tetsuro; Gotoh, Tadahiro; Amagai, Jun
2010-01-01
Caused by historical separation and driven by the requirements of the PC gaming industry, Graphics Processing Units (GPUs) have evolved to massive parallel processing systems which entered the area of non-graphic related applications. Although a single processing core on the GPU is much slower and provides less functionality than its counterpart on the CPU, the huge number of these small processing entities outperforms the classical processors when the application can be parallelized. Thus, in recent years various radio astronomical projects have started to make use of this technology either to realize the correlator on this platform or to establish the post-processing pipeline with GPUs. Therefore, the feasibility of GPUs as a choice for a VLBI correlator is being investigated, including pros and cons of this technology. Additionally, a GPU based software correlator will be reviewed with respect to energy consumption/GFlop/sec and cost/GFlop/sec.
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
Singhal, Nitin; Yoo, Jin Woo; Choi, Ho Yeol; Park, In Kyu
In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.
Betatron tune measurement with the LHC damper using a GPU
Dubouchet, Frédéric; Höfle, Wolfgang
This thesis studies a possible futur implementation of a betatron tune measure- ment in the Large Hadron Collider (LHC) at European organization for nuclear research (CERN) using a General Purpose Graphic Processing Unit (GPGPU) to analyse data acquired with the LHC transverse transverse damper (ADT). The present hardware and future possible implementations using ADT acquisi- tions and Graphic Processing Unit (GPU) computing are described. The ADT data have to be processed to extract the betatron tune. To compute the tune, the signal is transformed from the time domain to the frequency domain using Fast Fourier Transform (FFT) on GPUs. We show that it is possible to achieve one order of magnitude faster FFTs on a Fermi generation GPU than what can be done using a i7 generation Central Processing Unit (CPU). This makes online per bunch FFT computation and betatron tune measurement possible.
Numerical simulation of lava flow using a GPU SPH model
Eugenio Rustico; Annamaria Vicari; Giuseppe Bilotta; Alexis Hérault; Ciro Del Negro
2011-01-01
A smoothed particle hydrodynamics (SPH) method for lava-flow modeling was implemented on a graphical processing unit (GPU) using the compute unified device architecture (CUDA) developed by NVIDIA. This resulted in speed-ups of up to two orders of magnitude. The three-dimensional model can simulate lava flow on a real topography with free-surface, non- Newtonian fluids, and with phase change. The entire SPH code has three main components, neighbor list construction, force computation, an...
A GPU Accelerated Spring Mass System for Surgical Simulation
DEFF Research Database (Denmark)
Mosegaard, Jesper; Sørensen, Thomas Sangild
2005-01-01
There is a growing demand for surgical simulators to dofast and precise calculations of tissue deformation to simulateincreasingly complex morphology in real-time. Unfortunately, evenfast spring-mass based systems have slow convergence rates for largemodels. This paper presents a method to accele...... to accelerate computation of aspring-mass system in order to simulate a complex organ such as theheart. This acceleration is achieved by taking advantage of moderngraphics processing units (GPU)....
FIESTA 4: optimized Feynman integral calculations with GPU support
Smirnov, Alexander V
2015-01-01
This paper presents a new major release of the program FIESTA (Feynman Integral Evaluation by a Sector decomposiTion Approach). The new release is mainly aimed at optimal performance at large scales when one is increasing the number of sampling points in order to reduce the uncertainty estimates. The release now supports graphical processor units (GPU) for the numerical integration, methods to optimize cluster-usage, as well as other speed, memory, and stability improvements.
Stream programming framework for global ilumination techniques using a GPU
Marino, Federico J.; Abbate, Horacio Antonio
2007-01-01
Los procesadores de streams están comenzando a ser una alternativa accesible para implementar técnicas de rendering asistidas por hardware que habitualmente estaban relegadas al uso offline. Nosotros elaboramos un marco de trabajo para procesamiento de streams basado en los conceptos del modelo de Stream Programming, seleccionamos el algoritmo de Photon Mapping y una GPU (Graphics Processing Unit) Nvidia para una implementación de un caso de prueba. Definimos un conjunto de clases en C++ p...
Basket Option Pricing Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-08-01
We introduce a basket option pricing problem arisen in financial mathematics. We discretized the problem based on the alternating direction implicit (ADI) method and parallel cyclic reduction is applied to solve the set of tridiagonal matrices generated by the ADI method. To reduce the computational time of the problem, a general purpose graphics processing units (GP-GPU) environment is considered. Numerical results confirm the convergence and efficiency of the proposed method. © 2010 IEEE.
Parallel Implementation of Texture Based Image Retrieval on The GPU
Directory of Open Access Journals (Sweden)
Alireza Ahmadi Mohammadabadi
2013-07-01
Full Text Available Most image processing algorithms are inherently parallel, so multithreading processors are suitable in such applications. In huge image databases, image processing takes very long time for run on a single core processor because of single thread execution of algorithms. Graphical Processors Units (GPU is more common in most image processing applications due to multithread execution of algorithms, programmability and low cost. In this paper we implement texture based image retrieval system in parallel using Compute Unified Device Architecture (CUDA programming model to run on GPU. The main goal of this research work is to parallelize the process of texture based image retrieval through entropy, standard deviation, and local range, also whole process is much faster than normal. Our work uses extensive usage of highly multithreaded architecture of multi-cored GPU. We evaluated the retrieval of the proposed technique using Recall, Precision, and Average Precision measures. Experimental results showed that parallel implementation led to an average speed up of 140.046×over the serial implementation. The average Precision and the average Recall of presented method are 39.67% and 55.00% respectively.
Ultra-Fast Image Reconstruction of Tomosynthesis Mammography Using GPU
Directory of Open Access Journals (Sweden)
Arefan D
2015-06-01
Full Text Available Digital Breast Tomosynthesis (DBT is a technology that creates three dimensional (3D images of breast tissue. Tomosynthesis mammography detects lesions that are not detectable with other imaging systems. If image reconstruction time is in the order of seconds, we can use Tomosynthesis systems to perform Tomosynthesis-guided Interventional procedures. This research has been designed to study ultra-fast image reconstruction technique for Tomosynthesis Mammography systems using Graphics Processing Unit (GPU. At first, projections of Tomosynthesis mammography have been simulated. In order to produce Tomosynthesis projections, it has been designed a 3D breast phantom from empirical data. It is based on MRI data in its natural form. Then, projections have been created from 3D breast phantom. The image reconstruction algorithm based on FBP was programmed with C++ language in two methods using central processing unit (CPU card and the Graphics Processing Unit (GPU. It calculated the time of image reconstruction in two kinds of programming (using CPU and GPU.
GPU Lossless Hyperspectral Data Compression System for Space Applications
Keymeulen, Didier; Aranki, Nazeeh; Hopson, Ben; Kiely, Aaron; Klimesh, Matthew; Benkrid, Khaled
2012-01-01
On-board lossless hyperspectral data compression reduces data volume in order to meet NASA and DoD limited downlink capabilities. At JPL, a novel, adaptive and predictive technique for lossless compression of hyperspectral data, named the Fast Lossless (FL) algorithm, was recently developed. This technique uses an adaptive filtering method and achieves state-of-the-art performance in both compression effectiveness and low complexity. Because of its outstanding performance and suitability for real-time onboard hardware implementation, the FL compressor is being formalized as the emerging CCSDS Standard for Lossless Multispectral & Hyperspectral image compression. The FL compressor is well-suited for parallel hardware implementation. A GPU hardware implementation was developed for FL targeting the current state-of-the-art GPUs from NVIDIA(Trademark). The GPU implementation on a NVIDIA(Trademark) GeForce(Trademark) GTX 580 achieves a throughput performance of 583.08 Mbits/sec (44.85 MSamples/sec) and an acceleration of at least 6 times a software implementation running on a 3.47 GHz single core Intel(Trademark) Xeon(Trademark) processor. This paper describes the design and implementation of the FL algorithm on the GPU. The massively parallel implementation will provide in the future a fast and practical real-time solution for airborne and space applications.
High-speed optical coherence tomography signal processing on GPU
Energy Technology Data Exchange (ETDEWEB)
Li Xiqi; Shi Guohua; Zhang Yudong, E-mail: lixiqi@yahoo.cn [Laboratory on Adaptive Optics, Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209 (China)
2011-01-01
The signal processing speed of spectral domain optical coherence tomography (SD-OCT) has become a bottleneck in many medical applications. Recently, a time-domain interpolation method was proposed. This method not only gets a better signal-to noise ratio (SNR) but also gets a faster signal processing time for the SD-OCT than the widely used zero-padding interpolation method. Furthermore, the re-sampled data is obtained by convoluting the acquired data and the coefficients in time domain. Thus, a lot of interpolations can be performed concurrently. So, this interpolation method is suitable for parallel computing. An ultra-high optical coherence tomography signal processing can be realized by using graphics processing unit (GPU) with computer unified device architecture (CUDA). This paper will introduce the signal processing steps of SD-OCT on GPU. An experiment is performed to acquire a frame SD-OCT data (400A-linesx2048 pixel per A-line) and real-time processed the data on GPU. The results show that it can be finished in 6.208 milliseconds, which is 37 times faster than that on Central Processing Unit (CPU).
Implementation of GPU-accelerated back projection for EPR imaging.
Qiao, Zhiwei; Redler, Gage; Epel, Boris; Qian, Yuhua; Halpern, Howard
2015-01-01
Electron paramagnetic resonance (EPR) Imaging (EPRI) is a robust method for measuring in vivo oxygen concentration (pO2). For 3D pulse EPRI, a commonly used reconstruction algorithm is the filtered backprojection (FBP) algorithm, in which the backprojection process is computationally intensive and may be time consuming when implemented on a CPU. A multistage implementation of the backprojection can be used for acceleration, however it is not flexible (requires equal linear angle projection distribution) and may still be time consuming. In this work, single-stage backprojection is implemented on a GPU (Graphics Processing Units) having 1152 cores to accelerate the process. The GPU implementation results in acceleration by over a factor of 200 overall and by over a factor of 3500 if only the computing time is considered. Some important experiences regarding the implementation of GPU-accelerated backprojection for EPRI are summarized. The resulting accelerated image reconstruction is useful for real-time image reconstruction monitoring and other time sensitive applications.
Bin recycling strategy for improving the histogram precision on GPU
Cárdenas-Montes, Miguel; Rodríguez-Vázquez, Juan José; Vega-Rodríguez, Miguel A.
2016-07-01
Histogram is an easily comprehensible way to present data and analyses. In the current scientific context with access to large volumes of data, the processing time for building histogram has dramatically increased. For this reason, parallel construction is necessary to alleviate the impact of the processing time in the analysis activities. In this scenario, GPU computing is becoming widely used for reducing until affordable levels the processing time of histogram construction. Associated to the increment of the processing time, the implementations are stressed on the bin-count accuracy. Accuracy aspects due to the particularities of the implementations are not usually taken into consideration when building histogram with very large data sets. In this work, a bin recycling strategy to create an accuracy-aware implementation for building histogram on GPU is presented. In order to evaluate the approach, this strategy was applied to the computation of the three-point angular correlation function, which is a relevant function in Cosmology for the study of the Large Scale Structure of Universe. As a consequence of the study a high-accuracy implementation for histogram construction on GPU is proposed.
GPU-Monte Carlo based fast IMRT plan optimization
Directory of Open Access Journals (Sweden)
Yongbao Li
2014-03-01
Full Text Available Purpose: Intensity-modulated radiation treatment (IMRT plan optimization needs pre-calculated beamlet dose distribution. Pencil-beam or superposition/convolution type algorithms are typically used because of high computation speed. However, inaccurate beamlet dose distributions, particularly in cases with high levels of inhomogeneity, may mislead optimization, hindering the resulting plan quality. It is desire to use Monte Carlo (MC methods for beamlet dose calculations. Yet, the long computational time from repeated dose calculations for a number of beamlets prevents this application. It is our objective to integrate a GPU-based MC dose engine in lung IMRT optimization using a novel two-steps workflow.Methods: A GPU-based MC code gDPM is used. Each particle is tagged with an index of a beamlet where the source particle is from. Deposit dose are stored separately for beamlets based on the index. Due to limited GPU memory size, a pyramid space is allocated for each beamlet, and dose outside the space is neglected. A two-steps optimization workflow is proposed for fast MC-based optimization. At first step, a rough dose calculation is conducted with only a few number of particle per beamlet. Plan optimization is followed to get an approximated fluence map. In the second step, more accurate beamlet doses are calculated, where sampled number of particles for a beamlet is proportional to the intensity determined previously. A second-round optimization is conducted, yielding the final result.Results: For a lung case with 5317 beamlets, 105 particles per beamlet in the first round, and 108 particles per beam in the second round are enough to get a good plan quality. The total simulation time is 96.4 sec.Conclusion: A fast GPU-based MC dose calculation method along with a novel two-step optimization workflow are developed. The high efficiency allows the use of MC for IMRT optimizations.--------------------------------Cite this article as: Li Y, Tian Z
Eckert, C. H. J.; Zenker, E.; Bussmann, M.; Albach, D.
2016-10-01
We present an adaptive Monte Carlo algorithm for computing the amplified spontaneous emission (ASE) flux in laser gain media pumped by pulsed lasers. With the design of high power lasers in mind, which require large size gain media, we have developed the open source code HASEonGPU that is capable of utilizing multiple graphic processing units (GPUs). With HASEonGPU, time to solution is reduced to minutes on a medium size GPU cluster of 64 NVIDIA Tesla K20m GPUs and excellent speedup is achieved when scaling to multiple GPUs. Comparison of simulation results to measurements of ASE in Y b 3 + : Y AG ceramics show perfect agreement.
GPU-based Integration with Application in Sensitivity Analysis
Atanassov, Emanouil; Ivanovska, Sofiya; Karaivanova, Aneta; Slavov, Dimitar
2010-05-01
The presented work is an important part of the grid application MCSAES (Monte Carlo Sensitivity Analysis for Environmental Studies) which aim is to develop an efficient Grid implementation of a Monte Carlo based approach for sensitivity studies in the domains of Environmental modelling and Environmental security. The goal is to study the damaging effects that can be caused by high pollution levels (especially effects on human health), when the main modeling tool is the Danish Eulerian Model (DEM). Generally speaking, sensitivity analysis (SA) is the study of how the variation in the output of a mathematical model can be apportioned to, qualitatively or quantitatively, different sources of variation in the input of a model. One of the important classes of methods for Sensitivity Analysis are Monte Carlo based, first proposed by Sobol, and then developed by Saltelli and his group. In MCSAES the general Saltelli procedure has been adapted for SA of the Danish Eulerian model. In our case we consider as factors the constants determining the speeds of the chemical reactions in the DEM and as output a certain aggregated measure of the pollution. Sensitivity simulations lead to huge computational tasks (systems with up to 4 × 109 equations at every time-step, and the number of time-steps can be more than a million) which motivates its grid implementation. MCSAES grid implementation scheme includes two main tasks: (i) Grid implementation of the DEM, (ii) Grid implementation of the Monte Carlo integration. In this work we present our new developments in the integration part of the application. We have developed an algorithm for GPU-based generation of scrambled quasirandom sequences which can be combined with the CPU-based computations related to the SA. Owen first proposed scrambling of Sobol sequence through permutation in a manner that improves the convergence rates. Scrambling is necessary not only for error analysis but for parallel implementations. Good scrambling is
Accelerated rescaling of single Monte Carlo simulation runs with the Graphics Processing Unit (GPU).
Yang, Owen; Choi, Bernard
2013-01-01
To interpret fiber-based and camera-based measurements of remitted light from biological tissues, researchers typically use analytical models, such as the diffusion approximation to light transport theory, or stochastic models, such as Monte Carlo modeling. To achieve rapid (ideally real-time) measurement of tissue optical properties, especially in clinical situations, there is a critical need to accelerate Monte Carlo simulation runs. In this manuscript, we report on our approach using the Graphics Processing Unit (GPU) to accelerate rescaling of single Monte Carlo runs to calculate rapidly diffuse reflectance values for different sets of tissue optical properties. We selected MATLAB to enable non-specialists in C and CUDA-based programming to use the generated open-source code. We developed a software package with four abstraction layers. To calculate a set of diffuse reflectance values from a simulated tissue with homogeneous optical properties, our rescaling GPU-based approach achieves a reduction in computation time of several orders of magnitude as compared to other GPU-based approaches. Specifically, our GPU-based approach generated a diffuse reflectance value in 0.08ms. The transfer time from CPU to GPU memory currently is a limiting factor with GPU-based calculations. However, for calculation of multiple diffuse reflectance values, our GPU-based approach still can lead to processing that is ~3400 times faster than other GPU-based approaches.
Shi, Yulin; Veidenbaum, Alexander V.; Nicolau, Alex; Xu, Xiangmin
2014-01-01
Background Modern neuroscience research demands computing power. Neural circuit mapping studies such as those using laser scanning photostimulation (LSPS) produce large amounts of data and require intensive computation for post-hoc processing and analysis. New Method Here we report on the design and implementation of a cost-effective desktop computer system for accelerated experimental data processing with recent GPU computing technology. A new version of Matlab software with GPU enabled functions is used to develop programs that run on Nvidia GPUs to harness their parallel computing power. Results We evaluated both the central processing unit (CPU) and GPU-enabled computational performance of our system in benchmark testing and practical applications. The experimental results show that the GPU-CPU co-processing of simulated data and actual LSPS experimental data clearly outperformed the multi-core CPU with up to a 22x speedup, depending on computational tasks. Further, we present a comparison of numerical accuracy between GPU and CPU computation to verify the precision of GPU computation. In addition, we show how GPUs can be effectively adapted to improve the performance of commercial image processing software such as Adobe Photoshop. Comparison with Existing Method(s) To our best knowledge, this is the first demonstration of GPU application in neural circuit mapping and electrophysiology-based data processing. Conclusions Together, GPU enabled computation enhances our ability to process large-scale data sets derived from neural circuit mapping studies, allowing for increased processing speeds while retaining data precision. PMID:25277633
Web服务化的分布仿真运行支撑环境%Run-time infrastructure of distributed simulation based on Web Services
Institute of Scientific and Technical Information of China (English)
吴泽彬; 吴慧中; 李蔚清
2009-01-01
为提高高层体系结构的重用性和互操作性,对传统高层体系结构分布仿真模型进行去耦处理,引入仿真应用层和仿真通讯层,把仿真模型和本地运行支撑环境组件分离开来,提出低耦合的高层体系结构分布仿真模型和基于web服务的高层体系结构分布仿真体系结构.据此进行Web服务化运行支撑环境的设计和原型系统开发,并给出其意义和相关性能分析.使用基于超文本传输协议的简单对象访问协议,使高层体系结构兼容的仿真联邦成员能够与运行时间框架在广域网、局域网等各类网络上通讯,屏蔽防火墙等安全限制对通讯的影响.提出运行时间框架和仿真成员Web服务化的思路,为实现仿真应用动态组合提供技术支撑.在一定的实时性能损失前提下,Web服务化的运行时间框架能够大大提高高层体系结构分布仿真的重用性和互操作性.研究结果表明,Web服务化的运行支撑环境适用于广域网范围下粗粒度的分布仿真应用.%To improve reusability and interoperability of High Level Architecture(HLA)，traditional HLA distributed simulation model was decoupled by using simulation application layer and simulation communication layer to separate simulation model and local Run-Time lnfrastructure(RTI)component．A model of HLA distributed simulation with low coupling and HLA distributed simulation architecture based on Web Services was proposed．RTI based on Web Services(RTI-WS)was designed and its prototype was developed,the significance and corresponding performanee analysis was explained subsequently．HLA-compatible federates could communicate with RTI on both wide area network and local area network by Simple Object Access Protocol(SOAP)/HyperText Transport Protocol(HTTP)through most firewalls．RTl and federates were deployed as Web Services and could be integrated easily over the In ternet．RTI-WS could enhance the reusability and
GPU-based high performance Monte Carlo simulation in neutron transport
Energy Technology Data Exchange (ETDEWEB)
Heimlich, Adino; Mol, Antonio C.A.; Pereira, Claudio M.N.A. [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil). Lab. de Inteligencia Artificial Aplicada], e-mail: cmnap@ien.gov.br
2009-07-01
Graphics Processing Units (GPU) are high performance co-processors intended, originally, to improve the use and quality of computer graphics applications. Since researchers and practitioners realized the potential of using GPU for general purpose, their application has been extended to other fields out of computer graphics scope. The main objective of this work is to evaluate the impact of using GPU in neutron transport simulation by Monte Carlo method. To accomplish that, GPU- and CPU-based (single and multicore) approaches were developed and applied to a simple, but time-consuming problem. Comparisons demonstrated that the GPU-based approach is about 15 times faster than a parallel 8-core CPU-based approach also developed in this work. (author)
Simulation of isothermal multi-phase fuel-coolant interaction using MPS method with GPU acceleration
Energy Technology Data Exchange (ETDEWEB)
Gou, W.; Zhang, S.; Zheng, Y. [Zhejiang Univ., Hangzhou (China). Center for Engineering and Scientific Computation
2016-07-15
The energetic fuel-coolant interaction (FCI) has been one of the primary safety concerns in nuclear power plants. Graphical processing unit (GPU) implementation of the moving particle semi-implicit (MPS) method is presented and used to simulate the fuel coolant interaction problem. The governing equations are discretized with the particle interaction model of MPS. Detailed implementation on single-GPU is introduced. The three-dimensional broken dam is simulated to verify the developed GPU acceleration MPS method. The proposed GPU acceleration algorithm and developed code are then used to simulate the FCI problem. As a summary of results, the developed GPU-MPS method showed a good agreement with the experimental observation and theoretical prediction.
Semi-automatic tool to ease the creation and optimization of GPU programs
DEFF Research Database (Denmark)
Jepsen, Jacob
2014-01-01
We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form...... and apply optimizations thereto. Based on experiments, the code generated by the tool can be 3-256X faster than code generated by an OpenACC compiler, 4-37X faster than optimized CPU code, and attain up to 25% of peak performance of the GPU. We found that by using pattern-matching rules, many...... of the transformations can be performed automatically, which makes the tool usable for both novices and experts in GPU programming....
Tanner, David E; Phillips, James C; Schulten, Klaus
2012-07-10
Molecular dynamics methodologies comprise a vital research tool for structural biology. Molecular dynamics has benefited from technological advances in computing, such as multi-core CPUs and graphics processing units (GPUs), but harnessing the full power of hybrid GPU/CPU computers remains difficult. The generalized Born/solvent-accessible surface area implicit solvent model (GB/SA) stands to benefit from hybrid GPU/CPU computers, employing the GPU for the GB calculation and the CPU for the SA calculation. Here, we explore the computational challenges facing GB/SA calculations on hybrid GPU/CPU computers and demonstrate how NAMD, a parallel molecular dynamics program, is able to efficiently utilize GPUs and CPUs simultaneously for fast GB/SA simulations. The hybrid computation principles demonstrated here are generally applicable to parallel applications employing hybrid GPU/CPU calculations.
High-level GPU computing with jacket for MATLAB and C/C++
Pryor, Gallagher; Lucey, Brett; Maddipatla, Sandeep; McClanahan, Chris; Melonakos, John; Venugopalakrishnan, Vishwanath; Patel, Krunal; Yalamanchili, Pavan; Malcolm, James
2011-06-01
We describe a software platform for the rapid development of general purpose GPU (GPGPU) computing applications within the MATLAB computing environment, C, and C++: Jacket. Jacket provides thousands of GPU-tuned function syntaxes within MATLAB, C, and C++, including linear algebra, convolutions, reductions, and FFTs as well as signal, image, statistics, and graphics libraries. Additionally, Jacket includes a compiler that translates MATLAB and C++ code to CUDA PTX assembly and OpenGL shaders on demand at runtime. A facility is also included to compile a domain specific version of the MATLAB language to CUDA assembly at build time. Jacket includes the first parallel GPU FOR-loop construction and the first profiler for comparative analysis of CPU and GPU execution times. Jacket provides full GPU compute capability on CUDA hardware and limited, image processing focused compute on OpenGL/ES (2.0 and up) devices for mobile and embedded applications.
Liu, Yongchao; Schmidt, Bertil; Maskell, Douglas L
2011-03-29
Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets. We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve de novo assembly quality for de-Bruijn-graph-based assemblers. DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.
Directory of Open Access Journals (Sweden)
Schmidt Bertil
2011-03-01
Full Text Available Abstract Background Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets. Results We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve de novo assembly quality for de-Bruijn-graph-based assemblers. Conclusions DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.
MRISIMUL: a GPU-based parallel approach to MRI simulations.
Xanthis, Christos G; Venetis, Ioannis E; Chalkias, A V; Aletras, Anthony H
2014-03-01
A new step-by-step comprehensive MR physics simulator (MRISIMUL) of the Bloch equations is presented. The aim was to develop a magnetic resonance imaging (MRI) simulator that makes no assumptions with respect to the underlying pulse sequence and also allows for complex large-scale analysis on a single computer without requiring simplifications of the MRI model. We hypothesized that such a simulation platform could be developed with parallel acceleration of the executable core within the graphic processing unit (GPU) environment. MRISIMUL integrates realistic aspects of the MRI experiment from signal generation to image formation and solves the entire complex problem for densely spaced isochromats and for a densely spaced time axis. The simulation platform was developed in MATLAB whereas the computationally demanding core services were developed in CUDA-C. The MRISIMUL simulator imaged three different computer models: a user-defined phantom, a human brain model and a human heart model. The high computational power of GPU-based simulations was compared against other computer configurations. A speedup of about 228 times was achieved when compared to serially executed C-code on the CPU whereas a speedup between 31 to 115 times was achieved when compared to the OpenMP parallel executed C-code on the CPU, depending on the number of threads used in multithreading (2-8 threads). The high performance of MRISIMUL allows its application in large-scale analysis and can bring the computational power of a supercomputer or a large computer cluster to a single GPU personal computer.
Work-Efficient Parallel Skyline Computation for the GPU
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2015-01-01
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. State-of-the-art skyline computation involves complex tree traversals, data-ordering, and conditional branching to minimize the number of point-to-point comparisons. Meanwhile, GPGPU computing...... a global, static partitioning scheme. With the partitioning, we can permit controlled branching to exploit transitive relationships and avoid most point-to-point comparisons. The result is a non-traditional GPU algorithm, SkyAlign, that prioritizes work-effciency and respectable throughput, rather than...... maximal throughput, to achieve orders of magnitude faster performance....
Runtime Analysis of GPU-Based Stereo Matching
Directory of Open Access Journals (Sweden)
Christian Zentner
2015-11-01
Full Text Available This paper elaborates on the possibility to leverage the highly parallel nature of GPUs to implement more efficient stereo matching algorithms. Different algorithms have been implemented and compared on the CPU and the GPU in order to show the speedup gained by moving the computation to the graphics card. The results were evaluated for accuracy using the test available on the Middlebury website for stereo vision. An assessment of the runtime performance was done by a script which examined the runtime behaviour of the individual steps of the stereo matching algorithm.
Finding Convex Hulls Using Quickhull on the GPU
Tzeng, Stanley
2012-01-01
We present a convex hull algorithm that is accelerated on commodity graphics hardware. We analyze and identify the hurdles of writing a recursive divide and conquer algorithm on the GPU and divise a framework for representing this class of problems. Our framework transforms the recursive splitting step into a permutation step that is well-suited for graphics hardware. Our convex hull algorithm of choice is Quickhull. Our parallel Quickhull implementation (for both 2D and 3D cases) achieves an order of magnitude speedup over standard computational geometry libraries.
A Modular Framework for Deformation and Fracture using GPU Shaders
Morris, D J; Anderson, Eike F.; Peters, C.
2012-01-01
Advances in the graphical realism of modern video\\ud games have been achieved mainly through the development of\\ud the GPU (Graphics Processing Unit), providing a dedicated\\ud graphics co-processor and framebuffer. The most recent GPU’s\\ud are extremely capable and so flexible that it is now possible to\\ud implement a wide range of algorithms on graphics hardware\\ud that were previously confined to the computer’s CPU (Central\\ud Processing Unit). We present a modular framework for real-time\\u...
High performance GPU processing for inversion using uniform grid searches
Venetis, Ioannis E.; Saltogianni, Vasso; Stiros, Stathis; Gallopoulos, Efstratios
2017-04-01
Many geophysical problems are described by systems of redundant, highly non-linear systems of ordinary equations with constant terms deriving from measurements and hence representing stochastic variables. Solution (inversion) of such problems is based on numerical, optimization methods, based on Monte Carlo sampling or on exhaustive searches in cases of two or even three "free" unknown variables. Recently the TOPological INVersion (TOPINV) algorithm, a grid search-based technique in the Rn space, has been proposed. TOPINV is not based on the minimization of a certain cost function and involves only forward computations, hence avoiding computational errors. The basic concept is to transform observation equations into inequalities on the basis of an optimization parameter k and of their standard errors, and through repeated "scans" of n-dimensional search grids for decreasing values of k to identify the optimal clusters of gridpoints which satisfy observation inequalities and by definition contain the "true" solution. Stochastic optimal solutions and their variance-covariance matrices are then computed as first and second statistical moments. Such exhaustive uniform searches produce an excessive computational load and are extremely time consuming for common computers based on a CPU. An alternative is to use a computing platform based on a GPU, which nowadays is affordable to the research community, which provides a much higher computing performance. Using the CUDA programming language to implement TOPINV allows the investigation of the attained speedup in execution time on such a high performance platform. Based on synthetic data we compared the execution time required for two typical geophysical problems, modeling magma sources and seismic faults, described with up to 18 unknown variables, on both CPU/FORTRAN and GPU/CUDA platforms. The same problems for several different sizes of search grids (up to 1012 gridpoints) and numbers of unknown variables were solved on
A GPU code for analytic continuation through a sampling method
Nordström, Johan; Schött, Johan; Locht, Inka L. M.; Di Marco, Igor
We here present a code for performing analytic continuation of fermionic Green's functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU). The code is based on the sampling method introduced by Mishchenko et al. (2000), and is written for the widely used CUDA platform from NVidia. Detailed scaling tests are presented, for two different GPUs, in order to highlight the advantages of this code with respect to standard CPU computations. Finally, as an example of possible applications, we provide the analytic continuation of model Gaussian functions, as well as more realistic test cases from many-body physics.
An evaluation of GPU acceleration for sparse reconstruction
Braun, Thomas R.
2010-04-01
Image processing applications typically parallelize well. This gives a developer interested in data throughput several different implementation options, including multiprocessor machines, general purpose computation on the graphics processor, and custom gate-array designs. Herein, we will investigate these first two options for dictionary learning and sparse reconstruction, specifically focusing on the K-SVD algorithm for dictionary learning and the Batch Orthogonal Matching Pursuit for sparse reconstruction. These methods have been shown to provide state of the art results for image denoising, classification, and object recognition. We'll explore the GPU implementation and show that GPUs are not significantly better or worse than CPUs for this application.
Parallelized Local Volatility Estimation Using GP-GPU Hardware Acceleration
Douglas, Craig C.
2010-01-01
We introduce an inverse problem for the local volatility model in option pricing. We solve the problem using the Levenberg-Marquardt algorithm and use the notion of the Fréchet derivative when calculating the Jacobian matrix. We analyze the existence of the Fréchet derivative and its numerical computation. To reduce the computational time of the inverse problem, a GP-GPU environment is considered for parallel computation. Numerical results confirm the validity and efficiency of the proposed method. ©2010 IEEE.
GPU-Based Techniques for Global Illumination Effects
Szirmay-Kalos, László; Sbert, Mateu
2008-01-01
This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. This book targets game developers, graphics programmers, and also students with some basic understanding of computer graphics algorithms, rendering APIs like Direct3D or OpenGL, and shader programming. In order to make this book self-contained, the most important c
Institute of Scientific and Technical Information of China (English)
文志诚; 李长云; 满君丰
2012-01-01
In an open environment, diagnosing software fault in run-time is essential in a practical application. It is a good idea to build a hidden Markov model ( HMM) with some external features of run-time software and all possible fault. Collecting the run-time value of some external features, we can diagnose the software fault in run-time that general methods don't find it easily. In the HMM-building process, this paper proposes a so-called "3a principle" to make the continuous random variable be discrete, which has two unique advantages in making variable be discrete and determining the priori parameters of hidden Markov mode. The "3ct principle" accords to actual situation of run-time software and operates easily, which has a rigorous theoretical foundation. At the same time, this paper proposes an actual application case in open network environment. The experiments illuminate that the method proposed in this paper has unique advantage in diagnosing software fault in run-time.%在开放网络环境中,软件运行时的故障诊断与查找是必要的.利用软件运行时的外在表现特征与所有可能的故障建立隐马尔可夫模型,在应用中收集软件运行时外在表现特征的数据,可以诊断出用普通方法不易诊断出的软件故障.在建立隐马尔可夫模型过程中,文中提出使用”3σ原则”来离散化连续型随机变量,其在变量离散化及确定参数先验值方面具有独特优势,操作既方便又符合实际情况,且具有严格的理论依据；同时,给出一个开放网络环境的应用案例.通过仿真实验,证实本文所提出的方法在软件运行时故障诊断方面较其他方法具有独特的优势.
Ji, Zhe; Xu, Fei; Takahashi, Akiyuki; Sun, Yu
2016-12-01
In this paper, a Weakly Compressible Smoothed Particle Hydrodynamics (WCSPH) framework is presented utilizing the parallel architecture of single- and multi-GPU (Graphic Processing Unit) platforms. The program is developed for water entry simulations where an efficient potential based contact force is introduced to tackle the interaction between fluid and solid particles. The single-GPU SPH scheme is implemented with a series of optimization to achieve high performance. To go beyond the memory limitation of single GPU, the scheme is further extended to multi-GPU platform basing on an improved 3D domain decomposition and inter-node data communication strategy. A typical benchmark test of wedge entry is investigated in varied dimensions and scales to validate the accuracy and efficiency of the program. The results of 2D and 3D benchmark tests manifest great consistency with the experiment and better accuracy than other numerical models. The performance of the single-GPU code is assessed by comparing with serial and parallel CPU codes. The improvement of the domain decomposition strategy is verified, and a study on the scalability and efficiency of the multi-GPU code is carried out as well by simulating tests with varied scales in different amount of GPUs. Lastly, the single- and multi-GPU codes are further compared with existing state-of-the-art SPH parallel frameworks for a comprehensive assessment.
GPU accelerated simulations of bluff body flows using vortex particle methods
Rossinelli, Diego; Bergdorf, Michael; Cottet, Georges-Henri; Koumoutsakos, Petros
2010-05-01
We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing.
Zhang, Fan; Li, Guojun; Li, Wei; Hu, Wei; Hu, Yuxin
2016-04-07
With the development of synthetic aperture radar (SAR) technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC) methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO). However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX) method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing
Directory of Open Access Journals (Sweden)
Fan Zhang
2016-04-01
Full Text Available With the development of synthetic aperture radar (SAR technologies in recent years, the huge amount of remote sensing data brings challenges for real-time imaging processing. Therefore, high performance computing (HPC methods have been presented to accelerate SAR imaging, especially the GPU based methods. In the classical GPU based imaging algorithm, GPU is employed to accelerate image processing by massive parallel computing, and CPU is only used to perform the auxiliary work such as data input/output (IO. However, the computing capability of CPU is ignored and underestimated. In this work, a new deep collaborative SAR imaging method based on multiple CPU/GPU is proposed to achieve real-time SAR imaging. Through the proposed tasks partitioning and scheduling strategy, the whole image can be generated with deep collaborative multiple CPU/GPU computing. In the part of CPU parallel imaging, the advanced vector extension (AVX method is firstly introduced into the multi-core CPU parallel method for higher efficiency. As for the GPU parallel imaging, not only the bottlenecks of memory limitation and frequent data transferring are broken, but also kinds of optimized strategies are applied, such as streaming, parallel pipeline and so on. Experimental results demonstrate that the deep CPU/GPU collaborative imaging method enhances the efficiency of SAR imaging on single-core CPU by 270 times and realizes the real-time imaging in that the imaging rate outperforms the raw data generation rate.
CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications.
Lei, Guoqing; Dou, Yong; Wan, Wen; Xia, Fei; Li, Rongchun; Ma, Meng; Zou, Dan
2012-01-01
Prediction of ribonucleic acid (RNA) secondary structure remains one of the most important research areas in bioinformatics. The Zuker algorithm is one of the most popular methods of free energy minimization for RNA secondary structure prediction. Thus far, few studies have been reported on the acceleration of the Zuker algorithm on general-purpose processors or on extra accelerators such as Field Programmable Gate-Array (FPGA) and Graphics Processing Units (GPU). To the best of our knowledge, no implementation combines both CPU and extra accelerators, such as GPUs, to accelerate the Zuker algorithm applications. In this paper, a CPU-GPU hybrid computing system that accelerates Zuker algorithm applications for RNA secondary structure prediction is proposed. The computing tasks are allocated between CPU and GPU for parallel cooperate execution. Performance differences between the CPU and the GPU in the task-allocation scheme are considered to obtain workload balance. To improve the hybrid system performance, the Zuker algorithm is optimally implemented with special methods for CPU and GPU architecture. Speedup of 15.93× over optimized multi-core SIMD CPU implementation and performance advantage of 16% over optimized GPU implementation are shown in the experimental results. More than 14% of the sequences are executed on CPU in the hybrid system. The system combining CPU and GPU to accelerate the Zuker algorithm is proven to be promising and can be applied to other bioinformatics applications.
GPU accelerated cell-based adaptive mesh refinement on unstructured quadrilateral grid
Luo, Xisheng; Wang, Luying; Ran, Wei; Qin, Fenghua
2016-10-01
A GPU accelerated inviscid flow solver is developed on an unstructured quadrilateral grid in the present work. For the first time, the cell-based adaptive mesh refinement (AMR) is fully implemented on GPU for the unstructured quadrilateral grid, which greatly reduces the frequency of data exchange between GPU and CPU. Specifically, the AMR is processed with atomic operations to parallelize list operations, and null memory recycling is realized to improve the efficiency of memory utilization. It is found that results obtained by GPUs agree very well with the exact or experimental results in literature. An acceleration ratio of 4 is obtained between the parallel code running on the old GPU GT9800 and the serial code running on E3-1230 V2. With the optimization of configuring a larger L1 cache and adopting Shared Memory based atomic operations on the newer GPU C2050, an acceleration ratio of 20 is achieved. The parallelized cell-based AMR processes have achieved 2x speedup on GT9800 and 18x on Tesla C2050, which demonstrates that parallel running of the cell-based AMR method on GPU is feasible and efficient. Our results also indicate that the new development of GPU architecture benefits the fluid dynamics computing significantly.
GPU-accelerated Monte Carlo simulation of particle coagulation based on the inverse method
Wei, J.; Kruis, F. E.
2013-09-01
Simulating particle coagulation using Monte Carlo methods is in general a challenging computational task due to its numerical complexity and the computing cost. Currently, the lowest computing costs are obtained when applying a graphic processing unit (GPU) originally developed for speeding up graphic processing in the consumer market. In this article we present an implementation of accelerating a Monte Carlo method based on the Inverse scheme for simulating particle coagulation on the GPU. The abundant data parallelism embedded within the Monte Carlo method is explained as it will allow an efficient parallelization of the MC code on the GPU. Furthermore, the computation accuracy of the MC on GPU was validated with a benchmark, a CPU-based discrete-sectional method. To evaluate the performance gains by using the GPU, the computing time on the GPU against its sequential counterpart on the CPU were compared. The measured speedups show that the GPU can accelerate the execution of the MC code by a factor 10-100, depending on the chosen particle number of simulation particles. The algorithm shows a linear dependence of computing time with the number of simulation particles, which is a remarkable result in view of the n2 dependence of the coagulation.
Application of GPU processing for Brownian particle simulation
Cheng, Way Lee; Sheharyar, Ali; Sadr, Reza; Bouhali, Othmane
2015-01-01
Reports on the anomalous thermal-fluid properties of nanofluids (dilute suspension of nano-particles in a base fluid) have been the subject of attention for 15 years. The underlying physics that govern nanofluid behavior, however, is not fully understood and is a subject of much dispute. The interactions between the suspended particles and the base fluid have been cited as a major contributor to the improvement in heat transfer reported in the literature. Numerical simulations are instrumental in studying the behavior of nanofluids. However, such simulations can be computationally intensive due to the small dimensions and complexity of these problems. In this study, a simplified computational approach for isothermal nanofluid simulations was applied, and simulations were conducted using both conventional CPU and parallel GPU implementations. The GPU implementations significantly improved the computational performance, in terms of the simulation time, by a factor of 1000-2500. The results of this investigation show that, as the computational load increases, the simulation efficiency approaches a constant. At a very high computational load, the amount of improvement may even decrease due to limited system memory.
GPU Accelerated Spectral Element Methods: 3D Euler equations
Abdi, D. S.; Wilcox, L.; Giraldo, F.; Warburton, T.
2015-12-01
A GPU accelerated nodal discontinuous Galerkin method for the solution of three dimensional Euler equations is presented. The Euler equations are nonlinear hyperbolic equations that are widely used in Numerical Weather Prediction (NWP). Therefore, acceleration of the method plays an important practical role in not only getting daily forecasts faster but also in obtaining more accurate (high resolution) results. The equation sets used in our atomospheric model NUMA (non-hydrostatic unified model of the atmosphere) take into consideration non-hydrostatic effects that become more important with high resolution. We use algorithms suitable for the single instruction multiple thread (SIMT) architecture of GPUs to accelerate solution by an order of magnitude (20x) relative to CPU implementation. For portability to heterogeneous computing environment, we use a new programming language OCCA, which can be cross-compiled to either OpenCL, CUDA or OpenMP at runtime. Finally, the accuracy and performance of our GPU implementations are veried using several benchmark problems representative of different scales of atmospheric dynamics.
Electromagnetic metamaterial simulations using a GPU-accelerated FDTD method
Seok, Myung-Su; Lee, Min-Gon; Yoo, SeokJae; Park, Q.-Han
2015-12-01
Metamaterials composed of artificial subwavelength structures exhibit extraordinary properties that cannot be found in nature. Designing artificial structures having exceptional properties plays a pivotal role in current metamaterial research. We present a new numerical simulation scheme for metamaterial research. The scheme is based on a graphic processing unit (GPU)-accelerated finite-difference time-domain (FDTD) method. The FDTD computation can be significantly accelerated when GPUs are used instead of only central processing units (CPUs). We explain how the fast FDTD simulation of large-scale metamaterials can be achieved through communication optimization in a heterogeneous CPU/GPU-based computer cluster. Our method also includes various advanced FDTD techniques: the non-uniform grid technique, the total-field/scattered-field (TFSF) technique, the auxiliary field technique for dispersive materials, the running discrete Fourier transform, and the complex structure setting. We demonstrate the power of our new FDTD simulation scheme by simulating the negative refraction of light in a coaxial waveguide metamaterial.
CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis
Directory of Open Access Journals (Sweden)
Federico Raimondo
2012-01-01
Full Text Available In recent years, Independent Component Analysis (ICA has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
Implementation of meso-scale radioactive dispersion model for GPU
Energy Technology Data Exchange (ETDEWEB)
Sunarko [National Nuclear Energy Agency of Indonesia (BATAN), Jakarta (Indonesia). Nuclear Energy Assessment Center; Suud, Zaki [Bandung Institute of Technology (ITB), Bandung (Indonesia). Physics Dept.
2017-05-15
Lagrangian Particle Dispersion Method (LPDM) is applied to model atmospheric dispersion of radioactive material in a meso-scale of a few tens of kilometers for site study purpose. Empirical relationships are used to determine the dispersion coefficient for various atmospheric stabilities. Diagnostic 3-D wind-field is solved based on data from one meteorological station using mass-conservation principle. Particles representing radioactive pollutant are dispersed in the wind-field as a point source. Time-integrated air concentration is calculated using kernel density estimator (KDE) in the lowest layer of the atmosphere. Parallel code is developed for GTX-660Ti GPU with a total of 1 344 scalar processors using CUDA. A test of 1-hour release discovers that linear speedup is achieved starting at 28 800 particles-per-hour (pph) up to about 20 x at 14 4000 pph. Another test simulating 6-hour release with 36 000 pph resulted in a speedup of about 60 x. Statistical analysis reveals that resulting grid doses are nearly identical in both CPU and GPU versions of the code.
Porting AMG2013 to Heterogeneous CPU+GPU Nodes
Energy Technology Data Exchange (ETDEWEB)
Samfass, Philipp [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2017-01-26
LLNL's future advanced technology system SIERRA will feature heterogeneous compute nodes that consist of IBM PowerV9 CPUs and NVIDIA Volta GPUs. Conceptually, the motivation for such an architecture is quite straightforward: While GPUs are optimized for throughput on massively parallel workloads, CPUs strive to minimize latency for rather sequential operations. Yet, making optimal use of heterogeneous architectures raises new challenges for the development of scalable parallel software, e.g., with respect to work distribution. Porting LLNL's parallel numerical libraries to upcoming heterogeneous CPU+GPU architectures is therefore a critical factor for ensuring LLNL's future success in ful lling its national mission. One of these libraries, called HYPRE, provides parallel solvers and precondi- tioners for large, sparse linear systems of equations. In the context of this intern- ship project, I consider AMG2013 which is a proxy application for major parts of HYPRE that implements a benchmark for setting up and solving di erent systems of linear equations. In the following, I describe in detail how I ported multiple parts of AMG2013 to the GPU (Section 2) and present results for di erent experiments that demonstrate a successful parallel implementation on the heterogeneous ma- chines surface and ray (Section 3). In Section 4, I give guidelines on how my code should be used. Finally, I conclude and give an outlook for future work (Section 5).
GPU-accelerated visualization of protein dynamics in ribbon mode
Wahle, Manuel; Birmanns, Stefan
2011-01-01
Proteins are biomolecules present in living organisms and essential for carrying out vital functions. Inherent to their functioning is folding into different spatial conformations, and to understand these processes, it is crucial to visually explore the structural changes. In recent years, significant advancements in experimental techniques and novel algorithms for post-processing of protein data have routinely revealed static and dynamic structures of increasing sizes. In turn, interactive visualization of the systems and their transitions became more challenging. Therefore, much research for the efficient display of protein dynamics has been done, with the focus being space filling models, but for the important class of abstract ribbon or cartoon representations, there exist only few methods for an efficient rendering. Yet, these models are of high interest to scientists, as they provide a compact and concise description of the structure elements along the protein main chain. In this work, a method was developed to speed up ribbon and cartoon visualizations. Separating two phases in the calculation of geometry allows to offload computational work from the CPU to the GPU. The first phase consists of computing a smooth curve along the protein's main chain on the CPU. In the second phase, conducted independently by the GPU, vertices along that curve are moved to set up the final geometrical representation of the molecule.
Large Data Visualization on Distributed Memory Mulit-GPU Clusters
Energy Technology Data Exchange (ETDEWEB)
Childs, Henry R.
2010-03-01
Data sets of immense size are regularly generated on large scale computing resources. Even among more traditional methods for acquisition of volume data, such as MRI and CT scanners, data which is too large to be effectively visualization on standard workstations is now commonplace. One solution to this problem is to employ a 'visualization cluster,' a small to medium scale cluster dedicated to performing visualization and analysis of massive data sets generated on larger scale supercomputers. These clusters are designed to fit a different need than traditional supercomputers, and therefore their design mandates different hardware choices, such as increased memory, and more recently, graphics processing units (GPUs). While there has been much previous work on distributed memory visualization as well as GPU visualization, there is a relative dearth of algorithms which effectively use GPUs at a large scale in a distributed memory environment. In this work, we study a common visualization technique in a GPU-accelerated, distributed memory setting, and present performance characteristics when scaling to extremely large data sets.
CUDAICA: GPU optimization of Infomax-ICA EEG analysis.
Raimondo, Federico; Kamienkowski, Juan E; Sigman, Mariano; Fernandez Slezak, Diego
2012-01-01
In recent years, Independent Component Analysis (ICA) has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card) of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.
Meso-Scale Radioactive Dispersion Modelling using GPU
Sunarko; Suud, Zaki
2017-01-01
Lagrangian Particle Dispersion Method (LPDM) is applied to model atmospheric dispersion of radioactive material in a meso-scale of a few tens of kilometers for site study purpose. Empirical relationships are used to determine the dispersion coefficient for various atmospheric stabilities. Diagnostic 3-D wind field is created based on data from a meteorological station using mass-conservation principle. Particles imitating radioactive pollutant are dispersed in the wind-field as a point source. Time-integrated air concentration is calculated using kernel density estimator (KDE) in the lowest layer of the atmosphere. Parallel code is developed for GTX-660Ti GPU with a total of 1344 scalar processors using CUDA programming. Significant speedup of about 20 times is achieved compared to the serial version of the code while accuracy is kept at reasonable level. Only small differences in particle positions and grid doses are observed when using the same sets of random number and meteorological data in both CPU and GPU versions of the code.
GPU implementation of the simplex identification via split augmented Lagrangian
Sevilla, Jorge; Nascimento, José M. P.
2015-10-01
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Molecular dynamics simulations through GPU video games technologies
Loukatou, Styliani; Papageorgiou, Louis; Fakourelis, Paraskevas; Filntisi, Arianna; Polychronidou, Eleftheria; Bassis, Ioannis; Megalooikonomou, Vasileios; Makałowski, Wojciech; Vlachakis, Dimitrios; Kossida, Sophia
2016-01-01
Bioinformatics is the scientific field that focuses on the application of computer technology to the management of biological information. Over the years, bioinformatics applications have been used to store, process and integrate biological and genetic information, using a wide range of methodologies. One of the most de novo techniques used to understand the physical movements of atoms and molecules is molecular dynamics (MD). MD is an in silico method to simulate the physical motions of atoms and molecules under certain conditions. This has become a state strategic technique and now plays a key role in many areas of exact sciences, such as chemistry, biology, physics and medicine. Due to their complexity, MD calculations could require enormous amounts of computer memory and time and therefore their execution has been a big problem. Despite the huge computational cost, molecular dynamics have been implemented using traditional computers with a central memory unit (CPU). A graphics processing unit (GPU) computing technology was first designed with the goal to improve video games, by rapidly creating and displaying images in a frame buffer such as screens. The hybrid GPU-CPU implementation, combined with parallel computing is a novel technology to perform a wide range of calculations. GPUs have been proposed and used to accelerate many scientific computations including MD simulations. Herein, we describe the new methodologies developed initially as video games and how they are now applied in MD simulations. PMID:27525251
Scalable multi-GPU implementation of the MAGFLOW simulator
Directory of Open Access Journals (Sweden)
Giovanni Gallo
2011-12-01
Full Text Available We have developed a robust and scalable multi-GPU (Graphics Processing Unit version of the cellular-automaton-based MAGFLOW lava simulator. The cellular automaton is partitioned into strips that are assigned to different GPUs, with minimal overlapping. For each GPU, a host thread is launched to manage allocation, deallocation, data transfer and kernel launches; the main host thread coordinates all of the GPUs, to ensure temporal coherence and data integrity. The overlapping borders and maximum temporal step need to be exchanged among the GPUs at the beginning of every evolution of the cellular automaton; data transfers are asynchronous with respect to the computations, to cover the introduced overhead. It is not required to have GPUs of the same speed or capacity; the system runs flawlessly on homogeneous and heterogeneous hardware. The speed-up factor differs from that which is ideal (#GPUs× only for a constant overhead loss of about 4E−2 · T · #GPUs, with T as the total simulation time.
Radial basis function networks GPU-based implementation.
Brandstetter, Andreas; Artusi, Alessandro
2008-12-01
Neural networks (NNs) have been used in several areas, showing their potential but also their limitations. One of the main limitations is the long time required for the training process; this is not useful in the case of a fast training process being required to respond to changes in the application domain. A possible way to accelerate the learning process of an NN is to implement it in hardware, but due to the high cost and the reduced flexibility of the original central processing unit (CPU) implementation, this solution is often not chosen. Recently, the power of the graphic processing unit (GPU), on the market, has increased and it has started to be used in many applications. In particular, a kind of NN named radial basis function network (RBFN) has been used extensively, proving its power. However, their limiting time performances reduce their application in many areas. In this brief paper, we describe a GPU implementation of the entire learning process of an RBFN showing the ability to reduce the computational cost by about two orders of magnitude with respect to its CPU implementation.
True 4D Image Denoising on the GPU
Eklund, Anders; Andersson, Mats; Knutsson, Hans
2011-01-01
The use of image denoising techniques is an important part of many medical imaging applications. One common application is to improve the image quality of low-dose (noisy) computed tomography (CT) data. While 3D image denoising previously has been applied to several volumes independently, there has not been much work done on true 4D image denoising, where the algorithm considers several volumes at the same time. The problem with 4D image denoising, compared to 2D and 3D denoising, is that the computational complexity increases exponentially. In this paper we describe a novel algorithm for true 4D image denoising, based on local adaptive filtering, and how to implement it on the graphics processing unit (GPU). The algorithm was applied to a 4D CT heart dataset of the resolution 512 × 512 × 445 × 20. The result is that the GPU can complete the denoising in about 25 minutes if spatial filtering is used and in about 8 minutes if FFT-based filtering is used. The CPU implementation requires several days of processing time for spatial filtering and about 50 minutes for FFT-based filtering. The short processing time increases the clinical value of true 4D image denoising significantly. PMID:21977020
Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer
Institute of Scientific and Technical Information of China (English)
Feng Wang; Can-Qun Yang; Yun-Fei Du; Juan Chen; Hui-Zhan Yi; Wei-Xia Xu
2011-01-01
In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor's library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009.
Commodity CPU-GPU System for Low-Cost , High-Performance Computing
Wang, S.; Zhang, S.; Weiss, R. M.; Barnett, G. A.; Yuen, D. A.
2009-12-01
We have put together a desktop computer system for under 2.5 K dollars from commodity components that consist of one quad-core CPU (Intel Core 2 Quad Q6600 Kentsfield 2.4GHz) and two high end GPUs (nVidia's GeForce GTX 295 and Tesla C1060). A 1200 watt power supply is required. On this commodity system, we have constructed an easy-to-use hybrid computing environment, in which Message Passing Interface (MPI) is used for managing the working loads, for transferring the data among different GPU devices, and for minimizing the need of CPU’s memory. The test runs using the MAGMA (Matrix Algebra on GPU and Multicore Architectures) library show that the speed ups for double precision calculations can be greater than 10 (GPU vs. CPU) and they are bigger (> 20) for single precision calculations. In addition we have enabled the combination of Matlab with CUDA for interactive visualization through MPI, i.e., two GPU devices are used for simulation and one GPU device is used for visualizing the computing results as the simulation goes. Our experience with this commodity system has shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on CPUs. With NVIDIA CEO Jen-Hsun Huang's claim that over the next 6 years GPU processing power will increase by 570x compared to the 3x for CPUs, future low-cost commodity computers such as ours may be a remedy for the long wait queues of the world's supercomputers, especially for small- and mid-scale computation. Our goal here is to explore the limits and capabilities of this emerging technology and to get ourselves ready to run large-scale simulations on the next generation of computing environment, which we believe will hybridize CPU and GPU architectures.
A SURVEY PAPER ON SOLVING TSP USING ANT COLONY OPTIMIZATION ON GPU
Directory of Open Access Journals (Sweden)
Khushbu khatri
2015-10-01
Full Text Available Ant Colony Optimization (ACO is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP. There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in parallel architecture like GPU, CUDA architecture. Graphics Processing Unit (GPU provides highly parallel and fully programmable platform. GPUs which have many processing units with an off-chip global memory can be used for general purpose parallel computation. This paper presents a survey on different solving TSP using ACO on GPU.
GPU computing for 2-d spin systems: CUDA vs OpenGL
Anselmi, V; Di Renzo, F
2008-01-01
In recent years the more and more powerful GPU's available on the PC market have attracted attention as a cost effective solution for parallel (SIMD) computing. CUDA is a solid evidence of the attention that the major companies are devoting to the field. CUDA is a hardware and software architecture developed by Nvidia for computing on the GPU. It qualifies as a friendly alternative to the approach to GPU computing that has been pioneered in the OpenGL environment. We discuss the application of both the CUDA and the OpenGL approach to the simulation of 2-d spin systems (XY model).
Institute of Scientific and Technical Information of China (English)
Shenghan Ren; Xueli Chen; Xu Cao; Shouping Zhu; Jimin Liang
2016-01-01
Simplified spherical harmonics approximation (SPN) equations are widely used in modeling light propagation in biological tissues.However,with the increase of order N,its computational burden will severely aggravate.We propose a graphics processing unit (GPU) accelerated framework for SPN equations.Compared with the conventional central processing unit implementation,an increased performance of the GPU framework is obtained with an increase in mesh size,with the best speed-up ratio of 25 among the studied cases.The influence of thread distribution on the performance of the GPU framework is also investigated.
General-purpose molecular dynamics simulations on GPU-based clusters
Trott, Christian R.; Winterfeld, Lars; Crozier, Paul S.
2010-01-01
We present a GPU implementation of LAMMPS, a widely-used parallel molecular dynamics (MD) software package, and show 5x to 13x single node speedups versus the CPU-only version of LAMMPS. This new CUDA package for LAMMPS also enables multi-GPU simulation on hybrid heterogeneous clusters, using MPI for inter-node communication, CUDA kernels on the GPU for all methods working with particle data, and standard LAMMPS C++ code for CPU execution. Cell and neighbor list approaches are compared for be...
A Survey Paper on Solving TSP using Ant Colony Optimization on GPU
Directory of Open Access Journals (Sweden)
Khushbu Khatri
2014-12-01
Full Text Available Ant Colony Optimization (ACO is meta-heuristic algorithm inspired from nature to solve many combinatorial optimization problem such as Travelling Salesman Problem (TSP. There are many versions of ACO used to solve TSP like, Ant System, Elitist Ant System, Max-Min Ant System, Rank based Ant System algorithm. For improved performance, these methods can be implemented in parallel architecture like GPU, CUDA architecture. Graphics Processing Unit (GPU provides highly parallel and fully programmable platform. GPUs which have many processing units with an off-chip global memory can be used for general purpose parallel computation. This paper presents a survey on different solving TSP using ACO on GPU.
GPU-Based Tracking Algorithms for the ATLAS High-Level Trigger
Emeliyanov, D; The ATLAS collaboration
2012-01-01
Results on the performance and viability of data-parallel algorithms on Graphics Processing Units (GPUs) in the ATLAS Level 2 trigger system are presented. We describe the existing trigger data preparation and track reconstruction algorithms, motivation for their optimization, GPU-parallelized versions of these algorithms, and a "client-server" solution for hybrid CPU/GPU event processing used for integration of the GPU-oriented algorithms into existing ATLAS trigger software. The resulting speed-up of event processing times obtained with high-luminosity simulated data is presented and discussed.
基于GP U的域乘法并行算法的改进研究%GPU accelerated multiplication in GF (2m)
Institute of Scientific and Technical Information of China (English)
曾庆怡; 张明武; 张金霜
2012-01-01
This paper describes the general algorithm of multiplication for GF(2m) implemented using CUDA programming language for GPU(Graphic Processing Unit),and presents a new parallel algorithm called NPU-MUL.Comparing with the former,the new algorithm reduces a lot of atomic operations on global memory.The experimental result on GeForce GTS 250 shows that the running time of NPU-MUL is one fifth of the general one.%本论文介绍了GF（2m）域乘法运用CUDA编程语言在GPU（Graphic Processing Unit）上的并行加速的一般算法，并提出同样运用CUDA在GPU上实现的新型GF（2m）域乘法NPU-MUL并行算法，相较前者，该算法减少了大量对GPU全局存储器的原子操作。通过在NVIDIA公司的显示卡GeForce GTS 250上实现两种算法，表明NPU-MUL的运行时间是一般域乘法的运行时间的五分之一。
基于GPU运算的宏基因组第二代测序模拟软件%Next-Generation-Sequencing Simulator for Metagenomics by Using GPU
Institute of Scientific and Technical Information of China (English)
宣黎明; 韦朝春; 李亦学
2012-01-01
A customizable metagenome simulation system--NeSSM （Next-Generation Sequencing Simulator for Metagenomies） is presented here. By combining all the available genomes currently and some initial metagenome sequencing information, the system can create a simulated metagenome closer-than-ever to the reality. In order to reduce the system running time, we used the highly-parallel arithmetic by using GPU. Currently, NeSSM supports 454 and Illumina sequencing platforms.%提出一个新的宏基因组模拟系统NeSSM（Next—GenerationSequencingSimulatorforMetagenomics）。通过结合目前数据库中所有的基因组信息和一些初步的宏基因组测序信息，该系统可以产生一个接近于实际情况的模拟宏基因组测序数据。为了加快该系统的运行时间，运用了显卡计算（GPU并行化计算），通过并行化运算有效地缩短了程序运行所需时间。目前，NeSSM支持454和Illumina的测序平台。关键词：宏基因组；GPU；模拟系统；测序平台
Su, Lin; Du, Xining; Liu, Tianyu; Xu, X. George
2014-06-01
An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous EnviRonments - is being developed at Rensselaer Polytechnic Institute as a software testbed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs. This paper presents the preliminary code development and the testing involving radiation dose related problems. In particular, the paper discusses the electron transport simulations using the class-II condensed history method. The considered electron energy ranges from a few hundreds of keV to 30 MeV. For photon part, photoelectric effect, Compton scattering and pair production were modeled. Voxelized geometry was supported. A serial CPU code was first written in C++. The code was then transplanted to the GPU using the CUDA C 5.0 standards. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla™ M2090 GPUs. The code was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and later dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6x106 electron histories were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively. On-going work continues to test the code for different medical applications such as radiotherapy and brachytherapy.
Implementation of Multipattern String Matching Accelerated with GPU for Intrusion Detection System
Nehemia, Rangga; Lim, Charles; Galinium, Maulahikmah; Rinaldi Widianto, Ahmad
2017-04-01
As Internet-related security threats continue to increase in terms of volume and sophistication, existing Intrusion Detection System is also being challenged to cope with the current Internet development. Multi Pattern String Matching algorithm accelerated with Graphical Processing Unit is being utilized to improve the packet scanning performance of the IDS. This paper implements a Multi Pattern String Matching algorithm, also called Parallel Failureless Aho Corasick accelerated with GPU to improve the performance of IDS. OpenCL library is used to allow the IDS to support various GPU, including popular GPU such as NVIDIA and AMD, used in our research. The experiment result shows that the application of Multi Pattern String Matching using GPU accelerated platform provides a speed up, by up to 141% in term of throughput compared to the previous research.
Interior Point Methods on GPU with application to Model Predictive Control
DEFF Research Database (Denmark)
Gade-Nielsen, Nicolai Fog
The goal of this thesis is to investigate the application of interior point methods to solve dynamical optimization problems, using a graphical processing unit (GPU) with a focus on problems arising in Model Predictice Control (MPC). Multi-core processors have been available for over ten years now...... equations of the Hessian matrix. The use of a GPU has been shown to be very efficient in the factorization of dense matrices, and several numeric libraries, which utilize the GPU, have become available during the course of this thesis. We have developed a direct interior point method, which utilizes the GPU...... of different optimization algorithms are available for solving optimization problems. Some of the most common method are the simplex method and interior point methods. We focus on interior point methods in this thesis, due to its polynomial complexity, and since the use of the simplex method with GPUs have...
An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU
Energy Technology Data Exchange (ETDEWEB)
Yoon, Jong Seon; Choi, Hyoung Gwon [Seoul Nat’l Univ. of Science and Technology, Seoul (Korea, Republic of); Jeon, Byoung Jin [Yonsei Univ., Seoul (Korea, Republic of)
2017-02-15
The performance of the colored Gauss–Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss–Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss–Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.
Papadopoulos, Agathoklis; Kostoglou, Kyriaki; Mitsis, Georgios D; Theocharides, Theocharis
2015-01-01
The use of a GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in biomedical engineering and biology-related applications have shown promising results. GPU acceleration can be used to speedup computation-intensive models, such as the mathematical modeling of biological systems, which often requires the use of nonlinear modeling approaches with a large number of free parameters. In this context, we developed a CUDA-enabled version of a model which implements a nonlinear identification approach that combines basis expansions and polynomial-type networks, termed Laguerre-Volterra networks and can be used in diverse biological applications. The proposed software implementation uses the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the aforementioned modeling approach to execute the calculations on the GPU card of the host computer system. The initial results of the GPU-based model presented in this work, show performance improvements over the original MATLAB model.
GPU TECHNOLOGIES EMBODIED IN PARALLEL SOLVERS OF LINEAR ALGEBRAIC EQUATION SYSTEMS
Directory of Open Access Journals (Sweden)
Sidorov Alexander Vladimirovich
2012-10-01
Full Text Available The author reviews existing shareware solvers that are operated by graphical computer devices. The purpose of this review is to explore the opportunities and limitations of the above parallel solvers applicable for resolution of linear algebraic problems that arise at Research and Educational Centre of Computer Modeling at MSUCE, and Research and Engineering Centre STADYO. The author has explored new applications of the GPU in the PETSc suite and compared them with the results generated absent of the GPU. The research is performed within the CUSP library developed to resolve the problems of linear algebra through the application of GPU. The author has also reviewed the new MAGMA project which is analogous to LAPACK for the GPU.
Wu, Xin; Koslowski, Axel; Thiel, Walter
2012-07-10
In this work, we demonstrate that semiempirical quantum chemical calculations can be accelerated significantly by leveraging the graphics processing unit (GPU) as a coprocessor on a hybrid multicore CPU-GPU computing platform. Semiempirical calculations using the MNDO, AM1, PM3, OM1, OM2, and OM3 model Hamiltonians were systematically profiled for three types of test systems (fullerenes, water clusters, and solvated crambin) to identify the most time-consuming sections of the code. The corresponding routines were ported to the GPU and optimized employing both existing library functions and a GPU kernel that carries out a sequence of noniterative Jacobi transformations during pseudodiagonalization. The overall computation times for single-point energy calculations and geometry optimizations of large molecules were reduced by one order of magnitude for all methods, as compared to runs on a single CPU core.
Directory of Open Access Journals (Sweden)
G Boroni
2017-03-01
Full Text Available Lattice Boltzmann Method (LBM has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.
SkyAlign: a portable, work-efficient skyline algorithm for multicore and GPU architectures
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2016-01-01
The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely...... on compute throughput rather than partitioning, but fail to match the performance of sequential algorithms. In this paper, we introduce a new skyline algorithm, SkyAlign, that is designed for the GPU, and a GPU-friendly, grid-based tree structure upon which the algorithm relies. The search tree allows us...... to dramatically reduce the amount of work done by the GPU algorithm by avoiding most point-to-point comparisons at the cost of some compute throughput. This trade-off allows SkyAlign to achieve orders of magnitude faster performance than its predecessors. Moreover, a NUMA-oblivious port of SkyAlign outperforms...
Multi-GPU adaptation of a simulator of heart electric activity
Directory of Open Access Journals (Sweden)
Víctor M. García
2013-12-01
Full Text Available The simulation of the electrical activity of the heart is calculated by solving a large system of ordinary differential equations; this takes an enormous amount of computation time. In recent years graphics processing unit (GPU are being introduced in the field of high performance computing. These powerful computing devices have attracted research groups requiring simulate the electrical activity of the heart. The research group signing this paper has developed a simulator of cardiac electrical activity that runs on a single GPU. This article describes the adaptation and modification of the simulator to run on multiple GPU. The results confirm that the technique significantly reduces the execution time compared to those obtained with a single GPU, and allows the solution of larger problems.
GPU technology as a platform for accelerating local complexity analysis of protein sequences.
Papadopoulos, Agathoklis; Kirmitzoglou, Ioannis; Promponas, Vasilis J; Theocharides, Theocharis
2013-01-01
The use of GPGPU programming paradigm (running CUDA-enabled algorithms on GPU cards) in Bioinformatics showed promising results [1]. As such a similar approach can be used to speedup other algorithms such as CAST, a popular tool used for masking low-complexity regions (LCRs) in protein sequences [2] with increased sensitivity. We developed and implemented a CUDA-enabled version (GPU_CAST) of the multi-threaded version of CAST software first presented in [3] and optimized in [4]. The proposed software implementation uses the nVIDIA CUDA libraries and the GPGPU programming paradigm to take advantage of the inherent parallel characteristics of the CAST algorithm to execute the calculations on the GPU card of the host computer system. The GPU-based implementation presented in this work, is compared against the multi-threaded, multi-core optimized version of CAST [4] and yielded speedups of 5x-10x for large protein sequence datasets.
Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.
Nagaoka, Tomoaki; Watanabe, Soichi
2011-01-01
Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
Multi GPU Performance of Conjugate Gradient Solver with Staggered Fermions in Mixed Precision
Jang, Yong-Chull; Lee, Weonjong
2011-01-01
GPU has a significantly higher performance in single-precision computing than that of double precision. Hence, it is important to take a maximal advantage of the single precision in the CG inverter, using the mixed precision method. We have implemented mixed precision algorithm to our multi GPU conjugate gradient solver. The single precision calculation use half of the memory that is used by the double precision calculation, which allows twice faster data transfer in memory I/O. In addition, the speed of floating point calculations is 8 times faster in single precision than in double precision. The overall performance of our CUDA code for CG is 145 giga flops per GPU (GTX480), which does not include the infiniband network communication. If we include the infiniband communication, the overall performance is 36 giga flops per GPU (GTX480).
Singular value decomposition for collaborative filtering on a GPU
Kato, Kimikazu; Hosino, Tikara
2010-06-01
A collaborative filtering predicts customers' unknown preferences from known preferences. In a computation of the collaborative filtering, a singular value decomposition (SVD) is needed to reduce the size of a large scale matrix so that the burden for the next phase computation will be decreased. In this application, SVD means a roughly approximated factorization of a given matrix into smaller sized matrices. Webb (a.k.a. Simon Funk) showed an effective algorithm to compute SVD toward a solution of an open competition called "Netflix Prize". The algorithm utilizes an iterative method so that the error of approximation improves in each step of the iteration. We give a GPU version of Webb's algorithm. Our algorithm is implemented in the CUDA and it is shown to be efficient by an experiment.
GPU-powered Simulation Methodologies for Biological Systems
Directory of Open Access Journals (Sweden)
Dario Pescini
2013-09-01
Full Text Available The study of biological systems witnessed a pervasive cross-fertilization between experimental investigation and computational methods. This gave rise to the development of new methodologies, able to tackle the complexity of biological systems in a quantitative manner. Computer algorithms allow to faithfully reproduce the dynamics of the corresponding biological system, and, at the price of a large number of simulations, it is possible to extensively investigate the system functioning across a wide spectrum of natural conditions. To enable multiple analysis in parallel, using cheap, diffused and highly efficient multi-core devices we developed GPU-powered simulation algorithms for stochastic, deterministic and hybrid modeling approaches, so that also users with no knowledge of GPUs hardware and programming can easily access the computing power of graphics engines.
MATCHED FILTER COMPUTATION ON FPGA, CELL, AND GPU
Energy Technology Data Exchange (ETDEWEB)
BAKER, ZACHARY K. [Los Alamos National Laboratory; GOKHALE, MAYA B. [Los Alamos National Laboratory; TRIPP, JUSTIN L. [Los Alamos National Laboratory
2007-01-08
The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands. In this work, they evaluate the performance of a matched filter algorithm implementation on accelerated co-processor (XD1000), the IBM Cell microprocessor, and the NVIDIA GeForce 6900 GTX GPU graphics card. They provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, they explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using their results, they derive several performance metrics that provide the optimal solution for a variety of application situations.
GPU-based ultra fast IMRT plan optimization
Men, Chunhua; Choi, Dongju; Majumdar, Amitava; Zheng, Ziyi; Mueller, Klaus; Jiang, Steve B
2009-01-01
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real-time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). We adopt a penalty-based quadratic optimization model, which is solved by using a gradient projection method with Armijo's line search rule. Our optimization algorithm has been implemented in CUDA for parallel GPU computing as well as in C for serial CPU computing for comparison purpose. A prostate IMRT case with various beamlet and voxel sizes was used to evalu...
Explicit Integration with GPU Acceleration for Large Kinetic Networks
Brock, Benjamin; Billings, Jay Jay; Guidry, Mike
2014-01-01
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in compute time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractible, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
GPU Accelerated Real-Time Collision Handling in Virtual Disassembly
Institute of Scientific and Technical Information of China (English)
Peng Du; Jie-Yi Zhao; Wan-Bin Pan; Yi-Gang Wang
2015-01-01
Previous collision detection methods for virtual disassembly mainly detect collisions at discrete time intervals, and use oriented bounding boxes to speed up the process. However, these discrete methods cannot guarantee no penetration occurs when the components move. Meanwhile, because some of the components are embedded into each other, these components cannot be separated in the subsequent process. To solve these problems, we propose an approach for real-time collision handling by utilizing the computational power of modern GPUs. First we present a novel GPU-based collision handling framework for virtual disassembly. Second we use a collision-streams based continuous collision detection to guarantee no collision missed. Finally we introduce a triangle intersection detection algorithm to solve the problem that collision cannot be detected when the components are embedded into each other at the initial configuration. The experimental results show that our method can improve the overall performance of collision detection and achieve real-time simulation.
Accelerating separable footprint (SF) forward and back projection on GPU
Xie, Xiaobin; McGaffin, Madison G.; Long, Yong; Fessler, Jeffrey A.; Wen, Minhua; Lin, James
2017-03-01
Statistical image reconstruction (SIR) methods for X-ray CT can improve image quality and reduce radiation dosages over conventional reconstruction methods, such as filtered back projection (FBP). However, SIR methods require much longer computation time. The separable footprint (SF) forward and back projection technique simplifies the calculation of intersecting volumes of image voxels and finite-size beams in a way that is both accurate and efficient for parallel implementation. We propose a new method to accelerate the SF forward and back projection on GPU with NVIDIA's CUDA environment. For the forward projection, we parallelize over all detector cells. For the back projection, we parallelize over all 3D image voxels. The simulation results show that the proposed method is faster than the acceleration method of the SF projectors proposed by Wu and Fessler.13 We further accelerate the proposed method using multiple GPUs. The results show that the computation time is reduced approximately proportional to the number of GPUs.
Random number generators for massively parallel simulations on GPU
Manssen, Markus; Hartmann, Alexander K
2012-01-01
High-performance streams of (pseudo) random numbers are crucial for the efficient implementation for countless stochastic algorithms, most importantly, Monte Carlo simulations and molecular dynamics simulations with stochastic thermostats. A number of implementations of random number generators has been discussed for GPU platforms before and some generators are even included in the CUDA supporting libraries. Nevertheless, not all of these generators are well suited for highly parallel applications where each thread requires its own generator instance. For this specific situation encountered, for instance, in simulations of lattice models, most of the high-quality generators with large states such as Mersenne twister cannot be used efficiently without substantial changes. We provide a broad review of existing CUDA variants of random-number generators and present the CUDA implementation of a new massively parallel high-quality, high-performance generator with a small memory load overhead.
Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU
Raja, Chittampally Vasanth; Raghavendra, Prakash S; 10.5121/ijdps.2012.3209
2012-01-01
The vision of super computer at every desk can be realized by powerful and highly parallel CPUs or GPUs or APUs. Graphics processors once specialized for the graphics applications only, are now used for the highly computational intensive general purpose applications. Very expensive GFLOPs and TFLOP performance has become very cheap with the GPGPUs. Current work focuses mainly on the highly parallel implementation of Matrix Exponentiation. Matrix Exponentiation is widely used in many areas of scientific community ranging from highly critical flight, CAD simulations to financial, statistical applications. Proposed solution for Matrix Exponentiation uses OpenCL for exploiting the hyper parallelism offered by the many core GPGPUs. It employs many general GPU optimizations and architectural specific optimizations. This experimentation covers the optimizations targeted specific to the Scientific Graphics cards (Tesla-C2050). Heterogeneous Highly Parallel Matrix Exponentiation method has been tested for matrices of ...
Directory of Open Access Journals (Sweden)
Christley Scott
2010-08-01
Full Text Available Abstract Background Simulation of sophisticated biological models requires considerable computational power. These models typically integrate together numerous biological phenomena such as spatially-explicit heterogeneous cells, cell-cell interactions, cell-environment interactions and intracellular gene networks. The recent advent of programming for graphical processing units (GPU opens up the possibility of developing more integrative, detailed and predictive biological models while at the same time decreasing the computational cost to simulate those models. Results We construct a 3D model of epidermal development and provide a set of GPU algorithms that executes significantly faster than sequential central processing unit (CPU code. We provide a parallel implementation of the subcellular element method for individual cells residing in a lattice-free spatial environment. Each cell in our epidermal model includes an internal gene network, which integrates cellular interaction of Notch signaling together with environmental interaction of basement membrane adhesion, to specify cellular state and behaviors such as growth and division. We take a pedagogical approach to describing how modeling methods are efficiently implemented on the GPU including memory layout of data structures and functional decomposition. We discuss various programmatic issues and provide a set of design guidelines for GPU programming that are instructive to avoid common pitfalls as well as to extract performance from the GPU architecture. Conclusions We demonstrate that GPU algorithms represent a significant technological advance for the simulation of complex biological models. We further demonstrate with our epidermal model that the integration of multiple complex modeling methods for heterogeneous multicellular biological processes is both feasible and computationally tractable using this new technology. We hope that the provided algorithms and source code will be a
GPU-Acceleration of Parallel Unconditionally Stable Group Explicit Finite Difference Method
Parand, K.; Zafarvahedian, Saeed; Hossayni, Sayyed A.
2013-01-01
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general purposes, GPUs applications have been extended from graphics applications to other fields. The main objective of this paper is to evaluate the impact of using GPU in solution of the transient diffusion type equation by parallel and stable group explicit finite...
Efficient simulation of diffusion-based choice RT models on CPU and GPU.
Verdonck, Stijn; Meers, Kristof; Tuerlinckx, Francis
2016-03-01
In this paper, we present software for the efficient simulation of a broad class of linear and nonlinear diffusion models for choice RT, using either CPU or graphical processing unit (GPU) technology. The software is readily accessible from the popular scripting languages MATLAB and R (both 64-bit). The speed obtained on a single high-end GPU is comparable to that of a small CPU cluster, bringing standard statistical inference of complex diffusion models to the desktop platform.
Cloutier, B; Rigge, P
2012-01-01
A comparison of PGI OpenACC, FORTRAN CUDA, and Nvidia CUDA pseudospectral methods on a single GPU and GCC FORTRAN on single and multiple CPU cores is reported. The GPU implementations use CuFFT and the CPU implementations use FFTW. Porting pre-existing FORTRAN codes to utilize a GPUs is efficient and easy to implement with OpenACC and CUDA FORTRAN. Example programs are provided.
Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters
2012-12-01
are suited for threaded (parallel) execution, by labeling them as kernels using syntax specified by the GPU programming language (e.g., CUDA for an...Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi- GPU Clusters Eric Hayden, Mark Schmalz, William Chapman, Sanjay...Abstract - This paper presents a design for parallel processing of synthetic aperture radar (SAR) data using multiple Graphics Processing Units ( GPUs ). Our
High energy electromagnetic particle transportation on the GPU
Canal, P.; Elvira, D.; Jun, S. Y.; Kowalkowski, J.; Paterno, M.; Apostolakis, J.
2014-06-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
High energy electromagnetic particle transportation on the GPU
Energy Technology Data Exchange (ETDEWEB)
Canal, P. [Fermilab; Elvira, D. [Fermilab; Jun, S. Y. [Fermilab; Kowalkowski, J. [Fermilab; Paterno, M. [Fermilab; Apostolakis, J. [CERN
2014-01-01
We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.
GPU-based parallel clustered differential pulse code modulation
Wu, Jiaji; Li, Wenze; Kong, Wanqiu
2015-10-01
Hyperspectral remote sensing technology is widely used in marine remote sensing, geological exploration, atmospheric and environmental remote sensing. Owing to the rapid development of hyperspectral remote sensing technology, resolution of hyperspectral image has got a huge boost. Thus data size of hyperspectral image is becoming larger. In order to reduce their saving and transmission cost, lossless compression for hyperspectral image has become an important research topic. In recent years, large numbers of algorithms have been proposed to reduce the redundancy between different spectra. Among of them, the most classical and expansible algorithm is the Clustered Differential Pulse Code Modulation (CDPCM) algorithm. This algorithm contains three parts: first clusters all spectral lines, then trains linear predictors for each band. Secondly, use these predictors to predict pixels, and get the residual image by subtraction between original image and predicted image. Finally, encode the residual image. However, the process of calculating predictors is timecosting. In order to improve the processing speed, we propose a parallel C-DPCM based on CUDA (Compute Unified Device Architecture) with GPU. Recently, general-purpose computing based on GPUs has been greatly developed. The capacity of GPU improves rapidly by increasing the number of processing units and storage control units. CUDA is a parallel computing platform and programming model created by NVIDIA. It gives developers direct access to the virtual instruction set and memory of the parallel computational elements in GPUs. Our core idea is to achieve the calculation of predictors in parallel. By respectively adopting global memory, shared memory and register memory, we finally get a decent speedup.
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses.
Zhang, Bo; Yang, Xiang; Yang, Fei; Yang, Xin; Qin, Chenghu; Han, Dong; Ma, Xibo; Liu, Kai; Tian, Jie
2010-09-13
In molecular imaging (MI), especially the optical molecular imaging, bioluminescence tomography (BLT) emerges as an effective imaging modality for small animal imaging. The finite element methods (FEMs), especially the adaptive finite element (AFE) framework, play an important role in BLT. The processing speed of the FEMs and the AFE framework still needs to be improved, although the multi-thread CPU technology and the multi CPU technology have already been applied. In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU. The numerical experiments are designed to show the necessity, effect and application of the proposed CUBLAS and CULA based GPU acceleration. From the results of the experiments, we can reach the conclusion that the proposed CUBLAS and CULA based GPU acceleration method can improve the processing speed of the AFE framework very much while getting a balance between cost and performance.
GPU-based prompt gamma ray imaging from boron neutron capture therapy
Energy Technology Data Exchange (ETDEWEB)
Yoon, Do-Kun; Jung, Joo-Young; Suk Suh, Tae, E-mail: suhsanta@catholic.ac.kr [Department of Biomedical Engineering and Research Institute of Biomedical Engineering, College of Medicine, Catholic University of Korea, Seoul 505 137-701 (Korea, Republic of); Jo Hong, Key [Molecular Imaging Program at Stanford (MIPS), Department of Radiology, Stanford University, 300 Pasteur Drive, Stanford, California 94305 (United States); Sil Lee, Keum [Department of Radiation Oncology, Stanford University School of Medicine, 875 Blake Wilbur Drive, Stanford, California 94305-5847 (United States)
2015-01-15
Purpose: The purpose of this research is to perform the fast reconstruction of a prompt gamma ray image using a graphics processing unit (GPU) computation from boron neutron capture therapy (BNCT) simulations. Methods: To evaluate the accuracy of the reconstructed image, a phantom including four boron uptake regions (BURs) was used in the simulation. After the Monte Carlo simulation of the BNCT, the modified ordered subset expectation maximization reconstruction algorithm using the GPU computation was used to reconstruct the images with fewer projections. The computation times for image reconstruction were compared between the GPU and the central processing unit (CPU). Also, the accuracy of the reconstructed image was evaluated by a receiver operating characteristic (ROC) curve analysis. Results: The image reconstruction time using the GPU was 196 times faster than the conventional reconstruction time using the CPU. For the four BURs, the area under curve values from the ROC curve were 0.6726 (A-region), 0.6890 (B-region), 0.7384 (C-region), and 0.8009 (D-region). Conclusions: The tomographic image using the prompt gamma ray event from the BNCT simulation was acquired using the GPU computation in order to perform a fast reconstruction during treatment. The authors verified the feasibility of the prompt gamma ray image reconstruction using the GPU computation for BNCT simulations.
GPU-based parallel algorithm for blind image restoration using midfrequency-based methods
Xie, Lang; Luo, Yi-han; Bao, Qi-liang
2013-08-01
GPU-based general-purpose computing is a new branch of modern parallel computing, so the study of parallel algorithms specially designed for GPU hardware architecture is of great significance. In order to solve the problem of high computational complexity and poor real-time performance in blind image restoration, the midfrequency-based algorithm for blind image restoration was analyzed and improved in this paper. Furthermore, a midfrequency-based filtering method is also used to restore the image hardly with any recursion or iteration. Combining the algorithm with data intensiveness, data parallel computing and GPU execution model of single instruction and multiple threads, a new parallel midfrequency-based algorithm for blind image restoration is proposed in this paper, which is suitable for stream computing of GPU. In this algorithm, the GPU is utilized to accelerate the estimation of class-G point spread functions and midfrequency-based filtering. Aiming at better management of the GPU threads, the threads in a grid are scheduled according to the decomposition of the filtering data in frequency domain after the optimization of data access and the communication between the host and the device. The kernel parallelism structure is determined by the decomposition of the filtering data to ensure the transmission rate to get around the memory bandwidth limitation. The results show that, with the new algorithm, the operational speed is significantly increased and the real-time performance of image restoration is effectively improved, especially for high-resolution images.
GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows
Sunarso, Alfeus; Tsuji, Tomohiro; Chono, Shigeomi
2010-08-01
We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed. Some important issues in the implementation of computations that involve a large number of arithmetic operations and data on the GPU that has limited high-speed memory resources are addressed extensively. Despite the relatively low GPU occupancy in the calculation of intermolecular force and torque, the computation on a recent GPU is about 50 times faster than that on a single core of a recent CPU, thus simulations involving a large number of molecules using a personal computer are possible. The GPU-based simulation should allow an extensive investigation of the molecular-level mechanisms underlying various macroscopic flow phenomena in fluids with anisotropic molecules.
Fast distributed large-pixel-count hologram computation using a GPU cluster.
Pan, Yuechao; Xu, Xuewu; Liang, Xinan
2013-09-10
Large-pixel-count holograms are one essential part for big size holographic three-dimensional (3D) display, but the generation of such holograms is computationally demanding. In order to address this issue, we have built a graphics processing unit (GPU) cluster with 32.5 Tflop/s computing power and implemented distributed hologram computation on it with speed improvement techniques, such as shared memory on GPU, GPU level adaptive load balancing, and node level load distribution. Using these speed improvement techniques on the GPU cluster, we have achieved 71.4 times computation speed increase for 186M-pixel holograms. Furthermore, we have used the approaches of diffraction limits and subdivision of holograms to overcome the GPU memory limit in computing large-pixel-count holograms. 745M-pixel and 1.80G-pixel holograms were computed in 343 and 3326 s, respectively, for more than 2 million object points with RGB colors. Color 3D objects with 1.02M points were successfully reconstructed from 186M-pixel hologram computed in 8.82 s with all the above three speed improvement techniques. It is shown that distributed hologram computation using a GPU cluster is a promising approach to increase the computation speed of large-pixel-count holograms for large size holographic display.
Exploring Graphics Processing Unit (GPU Resource Sharing Efficiency for High Performance Computing
Directory of Open Access Journals (Sweden)
Teng Li
2013-11-01
Full Text Available The increasing incorporation of Graphics Processing Units (GPUs as accelerators has been one of the forefront High Performance Computing (HPC trends and provides unprecedented performance; however, the prevalent adoption of the Single-Program Multiple-Data (SPMD programming model brings with it challenges of resource underutilization. In other words, under SPMD, every CPU needs GPU capability available to it. However, since CPUs generally outnumber GPUs, the asymmetric resource distribution gives rise to overall computing resource underutilization. In this paper, we propose to efficiently share the GPU under SPMD and formally define a series of GPU sharing scenarios. We provide performance-modeling analysis for each sharing scenario with accurate experimentation validation. With the modeling basis, we further conduct experimental studies to explore potential GPU sharing efficiency improvements from multiple perspectives. Both further theoretical and experimental GPU sharing performance analysis and results are presented. Our results not only demonstrate the significant performance gain for SPMD programs with the proposed efficient GPU sharing, but also the further improved sharing efficiency with the optimization techniques based on our accurate modeling.
GMC: a GPU implementation of a Monte Carlo dose calculation based on Geant4.
Jahnke, Lennart; Fleckenstein, Jens; Wenz, Frederik; Hesser, Jürgen
2012-03-07
We present a GPU implementation called GMC (GPU Monte Carlo) of the low energy (CUDA programming interface. The classes for electron and photon interactions as well as a new parallel particle transport engine were implemented. The way a particle is processed is not in a history by history manner but rather by an interaction by interaction method. Every history is divided into steps that are then calculated in parallel by different kernels. The geometry package is currently limited to voxelized geometries. A modified parallel Mersenne twister was used to generate random numbers and a random number repetition method on the GPU was introduced. All phantom results showed a very good agreement between GPU and CPU simulation with gamma indices of >97.5% for a 2%/2 mm gamma criteria. The mean acceleration on one GTX 580 for all cases compared to Geant4 on one CPU core was 4860. The mean number of histories per millisecond on the GPU for all cases was 658 leading to a total simulation time for one intensity-modulated radiation therapy dose distribution of 349 s. In conclusion, Geant4-based Monte Carlo dose calculations were significantly accelerated on the GPU.
Real-time GPU surface curvature estimation on deforming meshes and volumetric data sets.
Griffin, Wesley; Wang, Yu; Berrios, David; Olano, Marc
2012-10-01
Surface curvature is used in a number of areas in computer graphics, including texture synthesis and shape representation, mesh simplification, surface modeling, and nonphotorealistic line drawing. Most real-time applications must estimate curvature on a triangular mesh. This estimation has been limited to CPU algorithms, forcing object geometry to reside in main memory. However, as more computational work is done directly on the GPU, it is increasingly common for object geometry to exist only in GPU memory. Examples include vertex skinned animations and isosurfaces from GPU-based surface reconstruction algorithms. For static models, curvature can be precomputed and CPU algorithms are a reasonable choice. For deforming models where the geometry only resides on the GPU, transferring the deformed mesh back to the CPU limits performance. We introduce a GPU algorithm for estimating curvature in real time on arbitrary triangular meshes. We demonstrate our algorithm with curvature-based NPR feature lines and a curvature-based approximation for an ambient occlusion. We show curvature computation on volumetric data sets with a GPU isosurface extraction algorithm and vertex-skinned animations. We present a graphics pipeline and CUDA implementation. Our curvature estimation is up to ~18x faster than a multithreaded CPU benchmark.
Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)
Hagiwara, K; Okamura, N; Rainwater, D L; Stelzer, T
2009-01-01
We use a graphics processing unit (GPU) for fast calculations of helicity amplitudes of quark and gluon scattering processes in massless QCD. New HEGET ({\\bf H}ELAS {\\bf E}valuation with {\\bf G}PU {\\bf E}nhanced {\\bf T}echnology) codes for gluon self-interactions are introduced, and a C++ program to convert the MadGraph generated FORTRAN codes into HEGET codes in CUDA (a C-platform for general purpose computing on GPU) is created. Because of the proliferation of the number of Feynman diagrams and the number of independent color amplitudes, the maximum number of final state jets we can evaluate on a GPU is limited to 4 for pure gluon processes ($gg\\to 4g$), or 5 for processes with one or more quark lines such as $q\\bar{q}\\to 5g$ and $qq\\to qq+3g$. Compared with the usual CPU-based programs, we obtain 60-100 times better performance on the GPU, except for 5-jet production processes and the $gg\\to 4g$ processes for which the GPU gain over the CPU is about 20.
A new embedded solution of hyperspectral data processing platform: the embedded GPU computer
Zhang, Lei; Gao, Jiao Bo; Hu, Yu; Sun, Ke Feng; Wang, Ying Hui; Cheng, Juan; Sun, Dan Dan; Li, Yu
2016-10-01
During the research of hyper-spectral imaging spectrometer, how to process the huge amount of image data is a difficult problem for all researchers. The amount of image data is about the order of magnitude of several hundred megabytes per second. Traditional solution of the embedded hyper-spectral data processing platform such as DSP and FPGA has its own drawback. With the development of GPU, parallel computing on GPU is increasingly applied in large-scale data processing. In this paper, we propose a new embedded solution of hyper-spectral data processing platform which is based on the embedded GPU computer. We also give a detailed discussion of how to acquire and process hyper-spectral data in embedded GPU computer. We use C++ AMP technology to control GPU and schedule the parallel computing. Experimental results show that the speed of hyper-spectral data processing on embedded GPU computer is apparently faster than ordinary computer. Our research has significant meaning for the engineering application of hyper-spectral imaging spectrometer.
Fast Clustering of Radar Reflectivity Data Using GPU-CPU Pipeline Scheme%基于GPU-CPU流水线的雷达回波快速聚类
Institute of Scientific and Technical Information of China (English)
周伟; 施宁; 王健; 汪群山
2012-01-01
In our meteorological application,the clustering algorithm was adopted for analysis and processing of radar reflectivity data.While facing problems of large scale of dataset and high dimension of feature vector,the clustering algorithm is too time-consuming to satisfy the real-time constraint in our applications.This paper proposes a parallelized clustering algorithm using GPU-CPU pipeline scheme to solve this problem.In our method,we utilized the feature of asynchronous execution between GPU and CPU,and organized the process of clustering into pipeline-style,with which we can largely exploit the parallelism in algorithm.The experimental results show that our GPU-CPU pipeline based parallelized clustering algorithm outperform normally parallelized clustering algorithm using CUDA without GPU-CPU pipeline by 38%.Compared to the serial code on CPU,out approach can achieve a 47x performance improvement,which makes it satisfy the requirements of real-time applications.%提出了基于GPU-CPU流水线的雷达回波快速聚类方法.该方法利用GPU与CPU异步执行的特征,将聚类的各步骤组织成流水线,大大的挖掘了聚类全过程的的并行性.实验表明,引入这种GPU-CPU流水线机制后,该方法比一般策略的基于GPU的并行聚类算法性能有38%的提升,而相对于传统的CPU上的串行程序,获得了47x的加速比,满足了气象实时分析应用中的实时性要求.
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Energy Technology Data Exchange (ETDEWEB)
Tian, Zhen, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Folkerts, Michael; Tan, Jun; Jia, Xun, E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu; Jiang, Steve B., E-mail: Zhen.Tian@UTSouthwestern.edu, E-mail: Xun.Jia@UTSouthwestern.edu, E-mail: Steve.Jiang@UTSouthwestern.edu [Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, Texas 75390 (United States); Peng, Fei [Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 (United States)
2015-06-15
Purpose: Volumetric modulated arc therapy (VMAT) optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units (GPUs) have been used to speed up the computations. However, GPU’s relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs, and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation-based VMAT algorithm, previously developed in the authors’ group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation-based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. The authors also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In the authors’ method, the sparse DDC matrix is first stored on a CPU in coordinate list format (COO). On the GPU side, this matrix is split into four submatrices according to beam angles, which are stored on four GPUs in compressed sparse row format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is accomplished using peer-to-peer access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein algorithm with a subspace step scheme is adopted here to solve the MP problem. A head and neck (H and N) cancer case is
Lossless data compression for improving the performance of a GPU-based beamformer.
Lok, U-Wai; Fan, Gang-Wei; Li, Pai-Chi
2015-04-01
The powerful parallel computation ability of a graphics processing unit (GPU) makes it feasible to perform dynamic receive beamforming However, a real time GPU-based beamformer requires high data rate to transfer radio-frequency (RF) data from hardware to software memory, as well as from central processing unit (CPU) to GPU memory. There are data compression methods (e.g. Joint Photographic Experts Group (JPEG)) available for the hardware front end to reduce data size, alleviating the data transfer requirement of the hardware interface. Nevertheless, the required decoding time may even be larger than the transmission time of its original data, in turn degrading the overall performance of the GPU-based beamformer. This article proposes and implements a lossless compression-decompression algorithm, which enables in parallel compression and decompression of data. By this means, the data transfer requirement of hardware interface and the transmission time of CPU to GPU data transfers are reduced, without sacrificing image quality. In simulation results, the compression ratio reached around 1.7. The encoder design of our lossless compression approach requires low hardware resources and reasonable latency in a field programmable gate array. In addition, the transmission time of transferring data from CPU to GPU with the parallel decoding process improved by threefold, as compared with transferring original uncompressed data. These results show that our proposed lossless compression plus parallel decoder approach not only mitigate the transmission bandwidth requirement to transfer data from hardware front end to software system but also reduce the transmission time for CPU to GPU data transfer. © The Author(s) 2014.
GPU accelerated generation of digitally reconstructed radiographs for 2-D/3-D image registration.
Dorgham, Osama M; Laycock, Stephen D; Fisher, Mark H
2012-09-01
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~24 ms using an NVidia GeForce 8800 GTX and in ~2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC
Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization.
Ruymgaart, A Peter; Elber, Ron
2012-11-13
We report Graphics Processing Unit (GPU) and Open-MP parallel implementations of water-specific force calculations and of bond constraints for use in Molecular Dynamics simulations. We focus on a typical laboratory computing-environment in which a CPU with a few cores is attached to a GPU. We discuss in detail the design of the code and we illustrate performance comparable to highly optimized codes such as GROMACS. Beside speed our code shows excellent energy conservation. Utilization of water-specific lists allows the efficient calculations of non-bonded interactions that include water molecules and results in a speed-up factor of more than 40 on the GPU compared to code optimized on a single CPU core for systems larger than 20,000 atoms. This is up four-fold from a factor of 10 reported in our initial GPU implementation that did not include a water-specific code. Another optimization is the implementation of constrained dynamics entirely on the GPU. The routine, which enforces constraints of all bonds, runs in parallel on multiple Open-MP cores or entirely on the GPU. It is based on Conjugate Gradient solution of the Lagrange multipliers (CG SHAKE). The GPU implementation is partially in double precision and requires no communication with the CPU during the execution of the SHAKE algorithm. The (parallel) implementation of SHAKE allows an increase of the time step to 2.0fs while maintaining excellent energy conservation. Interestingly, CG SHAKE is faster than the usual bond relaxation algorithm even on a single core if high accuracy is expected. The significant speedup of the optimized components transfers the computational bottleneck of the MD calculation to the reciprocal part of Particle Mesh Ewald (PME).
Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model
Navarro, Cristóbal A.; Huang, Wei; Deng, Youjin
2016-08-01
This work presents an adaptive multi-GPU Exchange Monte Carlo approach for the simulation of the 3D Random Field Ising Model (RFIM). The design is based on a two-level parallelization. The first level, spin-level parallelism, maps the parallel computation as optimal 3D thread-blocks that simulate blocks of spins in shared memory with minimal halo surface, assuming a constant block volume. The second level, replica-level parallelism, uses multi-GPU computation to handle the simulation of an ensemble of replicas. CUDA's concurrent kernel execution feature is used in order to fill the occupancy of each GPU with many replicas, providing a performance boost that is more notorious at the smallest values of L. In addition to the two-level parallel design, the work proposes an adaptive multi-GPU approach that dynamically builds a proper temperature set free of exchange bottlenecks. The strategy is based on mid-point insertions at the temperature gaps where the exchange rate is most compromised. The extra work generated by the insertions is balanced across the GPUs independently of where the mid-point insertions were performed. Performance results show that spin-level performance is approximately two orders of magnitude faster than a single-core CPU version and one order of magnitude faster than a parallel multi-core CPU version running on 16-cores. Multi-GPU performance is highly convenient under a weak scaling setting, reaching up to 99 % efficiency as long as the number of GPUs and L increase together. The combination of the adaptive approach with the parallel multi-GPU design has extended our possibilities of simulation to sizes of L = 32 , 64 for a workstation with two GPUs. Sizes beyond L = 64 can eventually be studied using larger multi-GPU systems.
Directory of Open Access Journals (Sweden)
Paul Richmond
Full Text Available High performance computing on the Graphics Processing Unit (GPU is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism, achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic" for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Richmond, Paul; Buesing, Lars; Giugliano, Michele; Vasilaki, Eleni
2011-05-04
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.
Energy Technology Data Exchange (ETDEWEB)
Arafat, Humayun [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA; Dinan, James [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Krishnamoorthy, Sriram [Computer Science and Mathematics Division, Pacific Northwest National Laboratory, Richland WA USA; Balaji, Pavan [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Sadayappan, P. [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA
2016-01-06
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.
GPU implementation issues for fast unmixing of hyperspectral images
Legendre, Maxime; Capriotti, Luca; Schmidt, Frédéric; Moussaoui, Saïd; Schmidt, Albrecht
2013-04-01
Space missions usually use hyperspectral imaging techniques to analyse the composition of planetary surfaces. Missions such as ESA's Mars Express and Venus Express generate extensive datasets whose processing demands so far have exceeded the resources available to many researchers. To overcome this limitation, the challenge is to develop numerical methods allowing to exploit the potential of modern calculation tools. The processing of a hyperspectral image consists of the identification of the observed surface components and eventually the assessment of their fractional abundances inside each pixel area. In this latter case, the problem is referred to as spectral unmixing. This work focuses on a supervised unmixing approach where the relevant component spectra are supposed to be part of an available spectral library. Therefore, the question addressed here is reduced to the estimation of the fractional abundances, or abundance maps. It requires the solution of a large-scale optimization problem subject to linear constraints; positivity of the abundances and their partial/full additivity (sum less/equal to one). Conventional approaches to such a problem usually suffer from a high computational overhead. Recently, an interior-point optimization using a primal-dual approach has been proven an efficient method to solve this spectral unmixing problem at reduced computational cost. This is achieved with a parallel implementation based on Graphics Processing Units (GPUs). Several issues are discussed such as the data organization in memory and the strategy used to compute efficiently one global quantity from a large dataset in a parallel fashion. Every step of the algorithm is optimized to be GPU-efficient. Finally, the main steps of the global system for the processing of a large number of hyperspectral images are discussed. The advantage of using a GPU is demonstrated by unmixing a large dataset consisting of 1300 hyperspectral images from Mars Express' OMEGA instrument
Katz, B. G.; Eppert, S.; Lohmann, D.; Li, S.; Goteti, G.; Kaheil, Y. H.
2011-12-01
At 4,400 meters, Mount Rainer has been the point of origin for several major lahar events. The largest event, termed the "Osceola Mudflow," occurred 5,500 years ago and covered an area of approximately 550km2 with a total volume of deposited material from 2 to 4km3. Particularly deadly, large lahars are estimated to have maximum flow velocities in of 100km/h with a density often described as "Flowing Concrete." While rare, these events typically cause total destruction within a lahar inundation zone. It is estimated that approximately 150,000 people live on top of previous deposits left by lahars which can be triggered by anything from earthquakes to glacial and chemical erosion of volcanic bedrock over time to liquefaction caused by extreme rainfall events. A novel methodology utilizing a 2 dimensional hydraulic model has been implemented allowing for high resolution (30m) lahar inundation maps to be generated. The utility of this model above or in addition to other methodologies such as that of Iverson (1998), lies in its portability to other lahar zones as well as its ability to model any total volume specified by the user. The process for generating lahar flood plains requires few inputs including: a Digital Terrain Map of any resolution (DTM), a mask defining the locations for lahar genesis, a raster of friction coefficients, and a time series depicting uniform material accumulation over the genesis mask which is allowed to flow down-slope. Finally, a significant improvement in speed has been made for solving the two dimensional model by utilizing the latest in graphics processing unit (GPU) technology which has resulted in a greater than 200 times speed up in model run time over previous CPU-based methods. The model runs for the Osceola Mudflow compare favorably with USGS derived inundation regions as derived using field measurements and GIS based approaches such as the LAHARZ program suit. Overall gradation of low to high risk match well, however the new
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Gong, Chunye; Liu, Jie; Chi, Lihua; Huang, Haowei; Fang, Jingyue; Gong, Zhenghu
2011-07-01
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates ( Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.
Toward Performance Portability of the FV3 Weather Model on CPU, GPU and MIC Processors
Govett, Mark; Rosinski, James; Middlecoff, Jacques; Schramm, Julie; Stringer, Lynd; Yu, Yonggang; Harrop, Chris
2017-04-01
The U.S. National Weather Service has selected the FV3 (Finite Volume cubed) dynamical core to become part of the its next global operational weather prediction model. While the NWS is preparing to run FV3 operationally in late 2017, NOAA's Earth System Research Laboratory is adapting the model to be capable of running on next-generation GPU and MIC processors. The FV3 model was designed in the 1990s, and while it has been extensively optimized for traditional CPU chips, some code refactoring has been required to expose sufficient parallelism needed to run on fine-grain GPU processors. The code transformations must demonstrate bit-wise reproducible results with the original CPU code, and between CPU, GPU and MIC processors. We will describe the parallelization and performance while attempting to maintain performance portability between CPU, GPU and MIC with the Fortran source code. Performance results will be shown using NOAA's new Pascal based fine-grain GPU system (800 GPUs), and for the Knights Landing processor on the National Science Foundation (NSF) Stampede-2 system.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Directory of Open Access Journals (Sweden)
Chun-Liang Lee
Full Text Available The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Lokavarapu, H. V.; Matsui, H.
2015-12-01
Convection and magnetic field of the Earth's outer core are expected to have vast length scales. To resolve these flows, high performance computing is required for geodynamo simulations using spherical harmonics transform (SHT), a significant portion of the execution time is spent on the Legendre transform. Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of the Earth. The code has been shown to scale well on computer clusters capable of computing at the order of 10⁵ cores using Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) parallelization for CPUs. To further optimize, we investigate three different algorithms of the SHT using GPUs. One is to preemptively compute the Legendre polynomials on the CPU before executing SHT on the GPU within the time integration loop. In the second approach, both the Legendre polynomials and the SHT are computed on the GPU simultaneously. In the third approach , we initially partition the radial grid for the forward transform and the harmonic order for the backward transform between the CPU and GPU. There after, the partitioned works are simultaneously computed in the time integration loop. We examine the trade-offs between space and time, memory bandwidth and GPU computations on Maverick, a Texas Advanced Computing Center (TACC) supercomputer. We have observed improved performance using a GPU enabled Legendre transform. Furthermore, we will compare and contrast the different algorithms in the context of GPUs.
Geant4-based Monte Carlo simulations on GPU for medical applications.
Bert, Julien; Perez-Ponce, Hector; El Bitar, Ziad; Jan, Sébastien; Boursier, Yannick; Vintache, Damien; Bonissent, Alain; Morel, Christian; Brasse, David; Visvikis, Dimitris
2013-08-21
Monte Carlo simulation (MCS) plays a key role in medical applications, especially for emission tomography and radiotherapy. However MCS is also associated with long calculation times that prevent its use in routine clinical practice. Recently, graphics processing units (GPU) became in many domains a low cost alternative for the acquisition of high computational power. The objective of this work was to develop an efficient framework for the implementation of MCS on GPU architectures. Geant4 was chosen as the MCS engine given the large variety of physics processes available for targeting different medical imaging and radiotherapy applications. In addition, Geant4 is the MCS engine behind GATE which is actually the most popular medical applications' simulation platform. We propose the definition of a global strategy and associated structures for such a GPU based simulation implementation. Different photon and electron physics effects are resolved on the fly directly on GPU without any approximations with respect to Geant4. Validations have shown equivalence in the underlying photon and electron physics processes between the Geant4 and the GPU codes with a speedup factor of 80-90. More clinically realistic simulations in emission and transmission imaging led to acceleration factors of 400-800 respectively compared to corresponding GATE simulations.
Application of Photon Transport Monte Carlo Module with GPU-based Parallel System
Energy Technology Data Exchange (ETDEWEB)
Park, Chang Je [Sejong University, Seoul (Korea, Republic of); Shon, Heejeong [Golden Eng. Co. LTD, Seoul (Korea, Republic of); Lee, Donghak [CoCo Link Inc., Seoul (Korea, Republic of)
2015-05-15
In general, it takes lots of computing time to get reliable results in Monte Carlo simulations especially in deep penetration problems with a thick shielding medium. To mitigate such a weakness of Monte Carlo methods, lots of variance reduction algorithms are proposed including geometry splitting and Russian roulette, weight windows, exponential transform, and forced collision, etc. Simultaneously, advanced computing hardware systems such as GPU(Graphics Processing Units)-based parallel machines are used to get a better performance of the Monte Carlo simulation. The GPU is much easier to access and to manage when comparing a CPU cluster system. It also becomes less expensive these days due to enhanced computer technology. There, lots of engineering areas adapt GPU-bases massive parallel computation technique. based photon transport Monte Carlo method. It provides almost 30 times speedup without any optimization and it is expected almost 200 times with fully supported GPU system. It is expected that GPU system with advanced parallelization algorithm will contribute successfully for development of the Monte Carlo module which requires quick and accurate simulations.
Airborne SAR Real-time Imaging Algorithm Design and Implementation with CUDA on NVIDIA GPU
Directory of Open Access Journals (Sweden)
Meng Da-di
2013-12-01
Full Text Available Synthetic Aperture Radar (SAR image processing requires huge computation amount. Traditionally, this task runs on the workstation or server based on Central Processing Unit (CPU and is rather time-consuming, hence real-time processing of SAR data is impossible. Based on Compute Unified Device Architecture (CUDA technology, a new plan of SAR imaging algorithm operated on NVIDIA Graphic Processing Unit (GPU is proposed. The new proposal makes it possible that the data processing procedure and CPU/GPU data exchanging execute concurrently, especially when SAR data size exceeds total GPU global memory size. Multi-GPU is suitably supported by the new proposal and all of computational resources are fully exploited. It is shown by experiment on NVIDIA K20C and INTEL E5645 that the proposed solution accelerates SAR data processing by tens of times. Consequently, the GPU based SAR processing system with the proposed solution embedded is much more power saving and portable, which makes it qualified to be a real-time SAR data processing system. Experiment shows that SAR data of 36 Mega points can be processed in real-time per second by K20C with the new solution equipped.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
Directory of Open Access Journals (Sweden)
Hamed Kargaran
2016-04-01
Full Text Available The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
GPU-accelerated Tersoff potentials for massively parallel Molecular Dynamics simulations
Nguyen, Trung Dac
2017-03-01
The Tersoff potential is one of the empirical many-body potentials that has been widely used in simulation studies at atomic scales. Unlike pair-wise potentials, the Tersoff potential involves three-body terms, which require much more arithmetic operations and data dependency. In this contribution, we have implemented the GPU-accelerated version of several variants of the Tersoff potential for LAMMPS, an open-source massively parallel Molecular Dynamics code. Compared to the existing MPI implementation in LAMMPS, the GPU implementation exhibits a better scalability and offers a speedup of 2.2X when run on 1000 compute nodes on the Titan supercomputer. On a single node, the speedup ranges from 2.0 to 8.0 times, depending on the number of atoms per GPU and hardware configurations. The most notable features of our GPU-accelerated version include its design for MPI/accelerator heterogeneous parallelism, its compatibility with other functionalities in LAMMPS, its ability to give deterministic results and to support both NVIDIA CUDA- and OpenCL-enabled accelerators. Our implementation is now part of the GPU package in LAMMPS and accessible for public use.
Computing OpenSURF on OpenCL and General Purpose GPU
Directory of Open Access Journals (Sweden)
Wanglong Yan
2013-10-01
Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.
Fast hybrid CPU- and GPU-based CT reconstruction algorithm using air skipping technique.
Lee, Byeonghun; Lee, Ho; Shin, Yeong Gil
2010-01-01
This paper presents a fast hybrid CPU- and GPU-based CT reconstruction algorithm to reduce the amount of back-projection operation using air skipping involving polygon clipping. The algorithm easily and rapidly selects air areas that have significantly higher contrast in each projection image by applying K-means clustering method on CPU, and then generates boundary tables for verifying valid region using segmented air areas. Based on these boundary tables of each projection image, clipped polygon that indicates active region when back-projection operation is performed on GPU is determined on each volume slice. This polygon clipping process makes it possible to use smaller number of voxels to be back-projected, which leads to a faster GPU-based reconstruction method. This approach has been applied to a clinical data set and Shepp-Logan phantom data sets having various ratio of air region for quantitative and qualitative comparison and analysis of our and conventional GPU-based reconstruction methods. The algorithm has been proved to reduce computational time to half without losing any diagnostic information, compared to conventional GPU-based approaches.
A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection.
Lee, Chun-Liang; Lin, Yi-Shan; Chen, Yaw-Chung
2015-01-01
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.
Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search.
Mei, Gang; Xu, Nengxiong; Xu, Liangliang
2016-01-01
This paper presents an efficient parallel Adaptive Inverse Distance Weighting (AIDW) interpolation algorithm on modern Graphics Processing Unit (GPU). The presented algorithm is an improvement of our previous GPU-accelerated AIDW algorithm by adopting fast k-nearest neighbors (kNN) search. In AIDW, it needs to find several nearest neighboring data points for each interpolated point to adaptively determine the power parameter; and then the desired prediction value of the interpolated point is obtained by weighted interpolating using the power parameter. In this work, we develop a fast kNN search approach based on the space-partitioning data structure, even grid, to improve the previous GPU-accelerated AIDW algorithm. The improved algorithm is composed of the stages of kNN search and weighted interpolating. To evaluate the performance of the improved algorithm, we perform five groups of experimental tests. The experimental results indicate: (1) the improved algorithm can achieve a speedup of up to 1017 over the corresponding serial algorithm; (2) the improved algorithm is at least two times faster than our previous GPU-accelerated AIDW algorithm; and (3) the utilization of fast kNN search can significantly improve the computational efficiency of the entire GPU-accelerated AIDW algorithm.
AVIST: A GPU-Centric Design for Visual Exploration of Large Multidimensional Datasets
Directory of Open Access Journals (Sweden)
Peng Mi
2016-10-01
Full Text Available This paper presents the Animated VISualization Tool (AVIST, an exploration-oriented data visualization tool that enables rapidly exploring and filtering large time series multidimensional datasets. AVIST highlights interactive data exploration by revealing fine data details. This is achieved through the use of animation and cross-filtering interactions. To support interactive exploration of big data, AVIST features a GPU (Graphics Processing Unit-centric design. Two key aspects are emphasized on the GPU-centric design: (1 both data management and computation are implemented on the GPU to leverage its parallel computing capability and fast memory bandwidth; (2 a GPU-based directed acyclic graph is proposed to characterize data transformations triggered by users’ demands. Moreover, we implement AVIST based on the Model-View-Controller (MVC architecture. In the implementation, we consider two aspects: (1 user interaction is highlighted to slice big data into small data; and (2 data transformation is based on parallel computing. Two case studies demonstrate how AVIST can help analysts identify abnormal behaviors and infer new hypotheses by exploring big datasets. Finally, we summarize lessons learned about GPU-based solutions in interactive information visualization with big data.
Fast 3D elastic micro-seismic source location using new GPU features
Xue, Qingfeng; Wang, Yibo; Chang, Xu
2016-12-01
In this paper, we describe new GPU features and their applications in passive seismic - micro-seismic location. Locating micro-seismic events is quite important in seismic exploration, especially when searching for unconventional oil and gas resources. Different from the traditional ray-based methods, the wave equation method, such as the method we use in our paper, has a remarkable advantage in adapting to low signal-to-noise ratio conditions and does not need a person to select the data. However, because it has a conspicuous deficiency due to its computation cost, these methods are not widely used in industrial fields. To make the method useful, we implement imaging-like wave equation micro-seismic location in a 3D elastic media and use GPU to accelerate our algorithm. We also introduce some new GPU features into the implementation to solve the data transfer and GPU utilization problems. Numerical and field data experiments show that our method can achieve a more than 30% performance improvement in GPU implementation just by using these new features.
The development of GPU-based parallel PRNG for Monte Carlo applications in CUDA Fortran
Kargaran, Hamed; Minuchehr, Abdolhamid; Zolfaghari, Ahmad
2016-04-01
The implementation of Monte Carlo simulation on the CUDA Fortran requires a fast random number generation with good statistical properties on GPU. In this study, a GPU-based parallel pseudo random number generator (GPPRNG) have been proposed to use in high performance computing systems. According to the type of GPU memory usage, GPU scheme is divided into two work modes including GLOBAL_MODE and SHARED_MODE. To generate parallel random numbers based on the independent sequence method, the combination of middle-square method and chaotic map along with the Xorshift PRNG have been employed. Implementation of our developed PPRNG on a single GPU showed a speedup of 150x and 470x (with respect to the speed of PRNG on a single CPU core) for GLOBAL_MODE and SHARED_MODE, respectively. To evaluate the accuracy of our developed GPPRNG, its performance was compared to that of some other commercially available PPRNGs such as MATLAB, FORTRAN and Miller-Park algorithm through employing the specific standard tests. The results of this comparison showed that the developed GPPRNG in this study can be used as a fast and accurate tool for computational science applications.
Computing OpenSURF on OpenCL and General Purpose GPU
Directory of Open Access Journals (Sweden)
Wanglong Yan
2013-10-01
Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia’s GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia’s GTX660 GPU, NVidia’s GTX460SE GPU and AMD’s Radeon HD 6850 GPU. Our OpenCL approach on NVidia’s GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel’s Dual-Core E5400 2.7G on average.
Run-time Energy Management for Mobiles
Smit, Lodewijk T.; Smit, Gerard J.M.; Havinga, Paul J.M.
2000-01-01
Due to limited energy resources, mobile computing requires an energy-efficient a rchitecture. The dynamic nature of a mobile environment demands an architecture that allows adapting to (quickly) changing conditions. The mobile has to adapt d ynamically to new circumstances in the best suitable manne
On run-time exploitation of concurrency
Hölzenspies, Philip Kaj Ferdinand
2010-01-01
The `free' speed-up stemming from ever increasing processor speed is over. Performance increase in computer systems can now only be achieved through parallelism. One of the biggest challenges in computer science is how to map applications onto parallel computers. Concurrency, seen as the set of v
ROX: Run-time optimization of XQueries
Abdel Kader, R.; Boncz, P.A.; Manegold, S.; Keulen, M. van
2009-01-01
Optimization of complex XQueries combining many XPath steps and joins is currently hindered by the absence of good cardinality estimation and cost models for XQuery. Additionally, the state-of-the-art of even relational query optimization still struggles to cope with cost model estimation errors tha
On run-time exploitation of concurrency
Holzenspies, P.K.F.
2010-01-01
The `free' speed-up stemming from ever increasing processor speed is over. Performance increase in computer systems can now only be achieved through parallelism. One of the biggest challenges in computer science is how to map applications onto parallel computers. Concurrency, seen as the set of vali
A GPU Reaction Diffusion Soil-Microbial Model
Falconer, Ruth; Houston, Alasdair; Schmidt, Sonja; Otten, Wilfred
2014-05-01
Parallelised algorithms are frequent in bioinformatics as a consequence of the close link to informatics - however in the field of soil science and ecology they are less prevalent. A current challenge in soil ecology is to link habitat structure to microbial dynamics. Soil science is therefore entering the 'big data' paradigm as a consequence of integrating data pertinent to the physical soil environment obtained via imaging and theoretical models describing growth and development of microbial dynamics permitting accurate analyses of spatio-temporal properties of different soil microenvironments. The microenvironment is often captured by 3D imaging (CT tomography) which yields large datasets and when used in computational studies the physical sizes of the samples that are amenable to computation are less than 1 cm3. Today's commodity graphics cards are programmable and possess a data parallel architecture that in many cases is capable of out-performing the CPU in terms of computational rates. The programmable aspect is achieved via a low-level parallel programming language (CUDA, OpenCL and DirectX). We ported a Soil-Microbial Model onto the GPU using the DirectX Compute API. We noted a significant computational speed up as well as an increase in the physical size that can be simulated. Some of the drawbacks of such an approach were concerned with numerical precision and the steep learning curve associated with GPGPU technologies.
GPU implementations of online track finding algorithms at PANDA
Energy Technology Data Exchange (ETDEWEB)
Herten, Andreas; Stockmanns, Tobias; Ritman, James [Institut fuer Kernphysik, Forschungszentrum Juelich GmbH (Germany); Adinetz, Andrew; Pleiter, Dirk [Juelich Supercomputing Centre, Forschungszentrum Juelich GmbH (Germany); Kraus, Jiri [NVIDIA GmbH (Germany); Collaboration: PANDA-Collaboration
2014-07-01
The PANDA experiment is a hadron physics experiment that will investigate antiproton annihilation in the charm quark mass region. The experiment is now being constructed as one of the main parts of the FAIR facility. At an event rate of 2 . 10{sup 7}/s a data rate of 200 GB/s is expected. A reduction of three orders of magnitude is required in order to save the data for further offline analysis. Since signal and background processes at PANDA have similar signatures, no hardware-level trigger is foreseen for the experiment. Instead, a fast online event filter is substituting this element. We investigate the possibility of using graphics processing units (GPUs) for the online tracking part of this task. Researched algorithms are a Hough Transform, a track finder involving Riemann surfaces, and the novel, PANDA-specific Triplet Finder. This talk shows selected advances in the implementations as well as performance evaluations of the GPU tracking algorithms to be used at the PANDA experiment.
FARGO3D: A NEW GPU-ORIENTED MHD CODE
Energy Technology Data Exchange (ETDEWEB)
Benitez-Llambay, Pablo [Instituto de Astronomía Teórica y Experimental, Observatorio Astronónomico, Universidad Nacional de Córdoba. Laprida 854, X5000BGR, Córdoba (Argentina); Masset, Frédéric S., E-mail: pbllambay@oac.unc.edu.ar, E-mail: masset@icf.unam.mx [Instituto de Ciencias Físicas, Universidad Nacional Autónoma de México (UNAM), Apdo. Postal 48-3,62251-Cuernavaca, Morelos (Mexico)
2016-03-15
We present the FARGO3D code, recently publicly released. It is a magnetohydrodynamics code developed with special emphasis on the physics of protoplanetary disks and planet–disk interactions, and parallelized with MPI. The hydrodynamics algorithms are based on finite-difference upwind, dimensionally split methods. The magnetohydrodynamics algorithms consist of the constrained transport method to preserve the divergence-free property of the magnetic field to machine accuracy, coupled to a method of characteristics for the evaluation of electromotive forces and Lorentz forces. Orbital advection is implemented, and an N-body solver is included to simulate planets or stars interacting with the gas. We present our implementation in detail and present a number of widely known tests for comparison purposes. One strength of FARGO3D is that it can run on either graphical processing units (GPUs) or central processing units (CPUs), achieving large speed-up with respect to CPU cores. We describe our implementation choices, which allow a user with no prior knowledge of GPU programming to develop new routines for CPUs, and have them translated automatically for GPUs.
GPU Acceleration of Particle-In-Cell Methods
Cowan, Benjamin; Cary, John; Sides, Scott
2016-10-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA Contract No. W31P4Q-16-C-0009.
Optimizing memory-bound SYMV kernel on GPU hardware accelerators
Abdelfattah, Ahmad M.
2013-01-01
Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively. © 2013 Springer-Verlag.
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
GAC Nicolas
2008-01-01
Full Text Available Abstract Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
High Speed 3D Tomography on CPU, GPU, and FPGA
Directory of Open Access Journals (Sweden)
Dominique Houzet
2009-02-01
Full Text Available Back-projection (BP is a costly computational step in tomography image reconstruction such as positron emission tomography (PET. To reduce the computation time, this paper presents a pipelined, prefetch, and parallelized architecture for PET BP (3PA-PET. The key feature of this architecture is its original memory access strategy, masking the high latency of the external memory. Indeed, the pattern of the memory references to the data acquired hinders the processing unit. The memory access bottleneck is overcome by an efficient use of the intrinsic temporal and spatial locality of the BP algorithm. A loop reordering allows an efficient use of general purpose processor's caches, for software implementation, as well as the 3D predictive and adaptive cache (3D-AP cache, when considering hardware implementations. Parallel hardware pipelines are also efficient thanks to a hierarchical 3D-AP cache: each pipeline performs a memory reference in about one clock cycle to reach a computational throughput close to 100%. The 3PA-PET architecture is prototyped on a system on programmable chip (SoPC to validate the system and to measure its expected performances. Time performances are compared with a desktop PC, a workstation, and a graphic processor unit (GPU.
Modern Methods of Bundle Adjustment on the Gpu
Hänsch, R.; Drude, I.; Hellwich, O.
2016-06-01
The task to compute 3D reconstructions from large amounts of data has become an active field of research within the last years. Based on an initial estimate provided by structure from motion, bundle adjustment seeks to find a solution that is optimal for all cameras and 3D points. The corresponding nonlinear optimization problem is usually solved by the Levenberg-Marquardt algorithm combined with conjugate gradient descent. While many adaptations and extensions to the classical bundle adjustment approach have been proposed, only few works consider the acceleration potentials of GPU systems. This paper elaborates the possibilities of time and space savings when fitting the implementation strategy to the terms and requirements of realizing a bundler on heterogeneous CPUGPU systems. Instead of focusing on the standard approach of Levenberg-Marquardt optimization alone, nonlinear conjugate gradient descent and alternating resection-intersection are studied as two alternatives. The experiments show that in particular alternating resection-intersection reaches low error rates very fast, but converges to larger error rates than Levenberg-Marquardt. PBA, as one of the current state-of-the-art bundlers, converges slower in 50 % of the test cases and needs 1.5-2 times more memory than the Levenberg- Marquardt implementation.
GPU acceleration of particle-in-cell methods
Cowan, Benjamin; Cary, John; Meiser, Dominic
2015-11-01
Graphics processing units (GPUs) have become key components in many supercomputing systems, as they can provide more computations relative to their cost and power consumption than conventional processors. However, to take full advantage of this capability, they require a strict programming model which involves single-instruction multiple-data execution as well as significant constraints on memory accesses. To bring the full power of GPUs to bear on plasma physics problems, we must adapt the computational methods to this new programming model. We have developed a GPU implementation of the particle-in-cell (PIC) method, one of the mainstays of plasma physics simulation. This framework is highly general and enables advanced PIC features such as high order particles and absorbing boundary conditions. The main elements of the PIC loop, including field interpolation and particle deposition, are designed to optimize memory access. We describe the performance of these algorithms and discuss some of the methods used. Work supported by DARPA contract W31P4Q-15-C-0061 (SBIR).
GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data.
Pang, Shuai; Stones, Rebecca J; Ren, Ming-Ming; Liu, Xiao-Guang; Wang, Gang; Xia, Hong-ju; Wu, Hao-Yang; Liu, Yang; Xie, Qiang
2015-09-01
We present a modified GPU (graphics processing unit) version of MrBayes, called ta(MC)(3) (GPU MrBayes V3.1), for Bayesian phylogenetic inference on protein data sets. Our main contributions are 1) utilizing 64-bit variables, thereby enabling ta(MC)(3) to process larger data sets than MrBayes; and 2) to use Kahan summation to improve accuracy, convergence rates, and consequently runtime. Versus the current fastest software, we achieve a speedup of up to around 2.5 (and up to around 90 vs. serial MrBayes), and more on multi-GPU hardware. GPU MrBayes V3.1 is available from http://sourceforge.net/projects/mrbayes-gpu/.
Berczik, P; Wang, L; Zhong, S; Veles, O; Zinchenko, I; Huang, S; Tsai, M; Kennedy, G; Li, S; Naso, L; Li, C
2013-01-01
We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware. These clusters are directly linked under the Chinese Academy of Sciences special GPU cluster program in the cooperation of ICCS (International Center for Computational Science). We reach about the half the peak Kepler K20 GPU performance for our phi-GPU code [2], in a real application scenario with individual hierarchically block time-steps with the high (4th, 6th and 8th) order Hermite integration schemes and a real core-halo density structure of the modeled stellar systems. The code and hardware are mainly used to simulate star clusters [23, 24] and galactic nuclei with supermassive black holes [20], in which correlations between distant particles cannot be neglected.
High performance technique for database applicationsusing a hybrid GPU/CPU platform
Zidan, Mohammed A.
2012-07-28
Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applica- tions by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency result- ing from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations. Copyright © 2011 by ASME.
How to obtain efficient GPU kernels: an illustration using FMM & FGT algorithms
Cruz, Felipe A; Barba, Lorena A
2010-01-01
Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on gpus. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the gpu. The end result has been gpu kernels that run at over 500 Gigaflops on one nvidia Tesla C1060 card, thereby reaching close to practical peak. We can confidently say that gpu computing is ...
BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images
Cossio, Pilar; Baruffa, Fabio; Rampp, Markus; Lindenstruth, Volker; Hummer, Gerhard
2016-01-01
In cryo-electron microscopy (EM), molecular structures are determined from large numbers of projection images of individual particles. To harness the full power of this single-molecule information, we use the Bayesian inference of EM (BioEM) formalism. By ranking structural models using posterior probabilities calculated for individual images, BioEM in principle addresses the challenge of working with highly dynamic or heterogeneous systems not easily handled in traditional EM reconstruction. However, the calculation of these posteriors for large numbers of particles and models is computationally demanding. Here we present highly parallelized, GPU-accelerated computer software that performs this task efficiently. Our flexible formulation employs CUDA, OpenMP, and MPI parallelization combined with both CPU and GPU computing. The resulting BioEM software scales nearly ideally both on pure CPU and on CPU+GPU architectures, thus enabling Bayesian analysis of tens of thousands of images in a reasonable time. The g...
GPU-based fast Monte Carlo simulation for radiotherapy dose calculation
Jia, Xun; Graves, Yan Jiang; Folkerts, Michael; Jiang, Steve B
2011-01-01
Monte Carlo (MC) simulation is commonly considered to be the most accurate dose calculation method in radiotherapy. However, its efficiency still requires improvement for many routine clinical applications. In this paper, we present our recent progress towards the development a GPU-based MC dose calculation package, gDPM v2.0. It utilizes the parallel computation ability of a GPU to achieve high efficiency, while maintaining the same particle transport physics as in the original DPM code and hence the same level of simulation accuracy. In GPU computing, divergence of execution paths between threads can considerably reduce the efficiency. Since photons and electrons undergo different physics and hence attain different execution paths, we use a simulation scheme where photon transport and electron transport are separated to partially relieve the thread divergence issue. High performance random number generator and hardware linear interpolation are also utilized. We have also developed various components to hand...
A GPU based real-time software correlation system for the Murchison Widefield Array prototype
Wayth, Randall B; Briggs, Frank H
2009-01-01
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardw...
GPU-based single-cluster algorithm for the simulation of the Ising model
Komura, Yukihiro; Okabe, Yutaka
2012-02-01
We present the GPU calculation with the common unified device architecture (CUDA) for the Wolff single-cluster algorithm of the Ising model. Proposing an algorithm for a quasi-block synchronization, we realize the Wolff single-cluster Monte Carlo simulation with CUDA. We perform parallel computations for the newly added spins in the growing cluster. As a result, the GPU calculation speed for the two-dimensional Ising model at the critical temperature with the linear size L = 4096 is 5.60 times as fast as the calculation speed on a current CPU core. For the three-dimensional Ising model with the linear size L = 256, the GPU calculation speed is 7.90 times as fast as the CPU calculation speed. The idea of quasi-block synchronization can be used not only in the cluster algorithm but also in many fields where the synchronization of all threads is required.
GPU-based single-cluster algorithm for the simulation of the Ising model
Komura, Yukihiro
2011-01-01
We present the GPU calculation with the common unified device architecture (CUDA) for the Wolff single-cluster algorithm of the Ising model. Proposing an algorithm for a quasi-block synchronization, we realize the Wolff single-cluster Monte Carlo simulation with CUDA. We perform parallel computations for the newly added spins in the growing cluster. As a result, the GPU calculation speed for the two-dimensional Ising model at the critical temperature with the linear size L=4096 is 5.60 times as fast as the calculation speed on a current CPU core. For the three-dimensional Ising model with the linear size L=256, the GPU calculation speed is 7.90 times as fast as the CPU calculation speed. The idea of quasi-block synchronization can be used not only in the cluster algorithm but also in many fields where the synchronization of all threads is required.
Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms
Directory of Open Access Journals (Sweden)
Valeria Cardellini
2014-01-01
Full Text Available We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.
Energy Technology Data Exchange (ETDEWEB)
Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL
2015-01-01
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
GPU peer-to-peer techniques applied to a cluster interconnect
Ammendola, Roberto; Biagioni, Andrea; Bisson, Mauro; Fatica, Massimiliano; Frezza, Ottorino; Cicero, Francesca Lo; Lonardo, Alessandro; Mastrostefano, Enrico; Paolucci, Pier Stanislao; Rossetti, Davide; Simula, Francesco; Tosoratto, Laura; Vicini, Piero
2013-01-01
Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.
Massively Parallel Computation of Soil Surface Roughness Parameters on A Fermi GPU
Li, Xiaojie; Song, Changhe
2016-06-01
Surface roughness is description of the surface micro topography of randomness or irregular. The standard deviation of surface height and the surface correlation length describe the statistical variation for the random component of a surface height relative to a reference surface. When the number of data points is large, calculation of surface roughness parameters is time-consuming. With the advent of Graphics Processing Unit (GPU) architectures, inherently parallel problem can be effectively solved using GPUs. In this paper we propose a GPU-based massively parallel computing method for 2D bare soil surface roughness estimation. This method was applied to the data collected by the surface roughness tester based on the laser triangulation principle during the field experiment in April 2012. The total number of data points was 52,040. It took 47 seconds on a Fermi GTX 590 GPU whereas its serial CPU version took 5422 seconds, leading to a significant 115x speedup.
4D MR phase and magnitude segmentations with GPU parallel computing.
Bergen, Robert V; Lin, Hung-Yu; Alexander, Murray E; Bidinosti, Christopher P
2015-01-01
The increasing size and number of data sets of large four dimensional (three spatial, one temporal) magnetic resonance (MR) cardiac images necessitates efficient segmentation algorithms. Analysis of phase-contrast MR images yields cardiac flow information which can be manipulated to produce accurate segmentations of the aorta. Phase contrast segmentation algorithms are proposed that use simple mean-based calculations and least mean squared curve fitting techniques. The initial segmentations are generated on a multi-threaded central processing unit (CPU) in 10 seconds or less, though the computational simplicity of the algorithms results in a loss of accuracy. A more complex graphics processing unit (GPU)-based algorithm fits flow data to Gaussian waveforms, and produces an initial segmentation in 0.5 seconds. Level sets are then applied to a magnitude image, where the initial conditions are given by the previous CPU and GPU algorithms. A comparison of results shows that the GPU algorithm appears to produce the most accurate segmentation.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Energy Technology Data Exchange (ETDEWEB)
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems.
Teodoro, George; Kurc, Tahsin M; Pan, Tony; Cooper, Lee A D; Kong, Jun; Widener, Patrick; Saltz, Joel H
2012-05-01
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of this computing power remains a very challenging problem. Most applications are still deployed to either GPU or CPU, leaving the other resource under- or un-utilized. In this paper, we propose, implement, and evaluate a performance aware scheduling technique along with optimizations to make efficient collaborative use of CPUs and GPUs on a parallel system. In the context of feature computations in large scale image analysis applications, our evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches.
GPU-accelerated Red Blood Cells Simulations with Transport Dissipative Particle Dynamics
Blumers, Ansel L; Li, Zhen; Li, Xuejin; Karniadakis, George E
2016-01-01
Mesoscopic numerical simulations provide a unique approach for the quantification of the chemical influences on red blood cell functionalities. The transport Dissipative Particles Dynamics (tDPD) method can lead to such effective multiscale simulations due to its ability to simultaneously capture mesoscopic advection, diffusion, and reaction. In this paper, we present a GPU-accelerated red blood cell simulation package based on a tDPD adaptation of our red blood cell model, which can correctly recover the cell membrane viscosity, elasticity, bending stiffness, and cross-membrane chemical transport. The package essentially processes all computational workloads in parallel by GPU, and it incorporates multi-stream scheduling and non-blocking MPI communications to improve inter-node scalability. Our code is validated for accuracy and compared against the CPU counterpart for speed. Strong scaling and weak scaling are also presented to characterizes scalability. We observe a speedup of 10.1 on one GPU over all 16 c...
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Directory of Open Access Journals (Sweden)
Meimei Liang
Full Text Available Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
Implementation and optimization of ultrasound signal processing algorithms on mobile GPU
Kong, Woo Kyu; Lee, Wooyoul; Kim, Kyu Cheol; Yoo, Yangmo; Song, Tai-Kyong
2014-03-01
A general-purpose graphics processing unit (GPGPU) has been used for improving computing power in medical ultrasound imaging systems. Recently, a mobile GPU becomes powerful to deal with 3D games and videos at high frame rates on Full HD or HD resolution displays. This paper proposes the method to implement ultrasound signal processing on a mobile GPU available in the high-end smartphone (Galaxy S4, Samsung Electronics, Seoul, Korea) with programmable shaders on the OpenGL ES 2.0 platform. To maximize the performance of the mobile GPU, the optimization of shader design and load sharing between vertex and fragment shader was performed. The beamformed data were captured from a tissue mimicking phantom (Model 539 Multipurpose Phantom, ATS Laboratories, Inc., Bridgeport, CT, USA) by using a commercial ultrasound imaging system equipped with a research package (Ultrasonix Touch, Ultrasonix, Richmond, BC, Canada). The real-time performance is evaluated by frame rates while varying the range of signal processing blocks. The implementation method of ultrasound signal processing on OpenGL ES 2.0 was verified by analyzing PSNR with MATLAB gold standard that has the same signal path. CNR was also analyzed to verify the method. From the evaluations, the proposed mobile GPU-based processing method has no significant difference with the processing using MATLAB (i.e., PSNR<52.51 dB). The comparable results of CNR were obtained from both processing methods (i.e., 11.31). From the mobile GPU implementation, the frame rates of 57.6 Hz were achieved. The total execution time was 17.4 ms that was faster than the acquisition time (i.e., 34.4 ms). These results indicate that the mobile GPU-based processing method can support real-time ultrasound B-mode processing on the smartphone.
FastGCN: a GPU accelerated tool for fast gene co-expression networks.
Liang, Meimei; Zhang, Futao; Jin, Gulei; Zhu, Jun
2015-01-01
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.
GPU-accelerated 3D neutron diffusion code based on finite difference method
Energy Technology Data Exchange (ETDEWEB)
Xu, Q.; Yu, G.; Wang, K. [Dept. of Engineering Physics, Tsinghua Univ. (China)
2012-07-01
Finite difference method, as a traditional numerical solution to neutron diffusion equation, although considered simpler and more precise than the coarse mesh nodal methods, has a bottle neck to be widely applied caused by the huge memory and unendurable computation time it requires. In recent years, the concept of General-Purpose computation on GPUs has provided us with a powerful computational engine for scientific research. In this study, a GPU-Accelerated multi-group 3D neutron diffusion code based on finite difference method was developed. First, a clean-sheet neutron diffusion code (3DFD-CPU) was written in C++ on the CPU architecture, and later ported to GPUs under NVIDIA's CUDA platform (3DFD-GPU). The IAEA 3D PWR benchmark problem was calculated in the numerical test, where three different codes, including the original CPU-based sequential code, the HYPRE (High Performance Pre-conditioners)-based diffusion code and CITATION, were used as counterpoints to test the efficiency and accuracy of the GPU-based program. The results demonstrate both high efficiency and adequate accuracy of the GPU implementation for neutron diffusion equation. A speedup factor of about 46 times was obtained, using NVIDIA's Geforce GTX470 GPU card against a 2.50 GHz Intel Quad Q9300 CPU processor. Compared with the HYPRE-based code performing in parallel on an 8-core tower server, the speedup of about 2 still could be observed. More encouragingly, without any mathematical acceleration technology, the GPU implementation ran about 5 times faster than CITATION which was speeded up by using the SOR method and Chebyshev extrapolation technique. (authors)
Implementation of a GPU accelerated total focusing reconstruction method within CIVA software
Rougeron, Gilles; Lambert, Jason; Iakovleva, Ekaterina; Lacassagne, Lionel; Dominguez, Nicolas
2014-02-01
This paper presents results of a TFM implementation for Full Matrix Capture acquisitions in CIVA, proposed as a post-processing tool for accurate analysis. This implementation has been made on GPU architecture with OpenCL to minimize the processing time and offer computational device flexibility (GPU/CPU). Examples on immersion configurations on isotropic 2D CAD specimen with planar extrusion are proposed to illustrate the performances. Reconstructions on 2D or 3D areas of direct echoes with mode conversion are allowed. Probe scanning can also be taken into account. Reconstruction results and a benchmark explaining the speedup are presented. Further improvements are also reviewed.
Real-time high definition H.264 video decode using the Xbox 360 GPU
Arevalo Baeza, Juan Carlos; Chen, William; Christoffersen, Eric; Dinu, Daniel; Friemel, Barry
2007-09-01
The Xbox 360 is powered by three dual pipeline 3.2 GHz IBM PowerPC processors and a 500 MHz ATI graphics processing unit. The Graphics Processing Unit (GPU) is a special-purpose device, intended to create advanced visual effects and to render realistic scenes for the latest Xbox 360 games. In this paper, we report work on using the GPU as a parallel processing unit to accelerate the decoding of H.264/AVC high-definition (1920x1080) video. We report our experiences in developing a real-time, software-only high-definition video decoder for the Xbox 360.
Adaptive Row-grouped CSR Format for Storing of Sparse Matrices on GPU
Heller, Martin
2012-01-01
We present new adaptive format for storing sparse matrices on GPU. We compare it with several other formats including CUSPARSE which is today probably the best choice for processing of sparse matrices on GPU in CUDA. Contrary to CUSPARSE which works with common CSR format, our new format requires conversion. However, multiplication of sparse-matrix and vector is significantly faster for many atrices. We demonstrate it on set of 1 600 matrices and we show for what types of matrices our format is profitable.
A new GPU-accelerated hydrodynamical code for numerical simulation of interacting galaxies
Igor, Kulikov
2013-01-01
In this paper a new scalable hydrodynamic code GPUPEGAS (GPU-accelerated PErformance Gas Astrophysic Simulation) for simulation of interacting galaxies is proposed. The code is based on combination of Godunov method as well as on the original implementation of FlIC method, specially adapted for GPU-implementation. Fast Fourier Transform is used for Poisson equation solution in GPUPEGAS. Software implementation of the above methods was tested on classical gas dynamics problems, new Aksenov's test and classical gravitational gas dynamics problems. Collisionless hydrodynamic approach was used for modelling of stars and dark matter. The scalability of GPUPEGAS computational accelerators is shown.
Fast 4$\\pi$ track reconstruction in nuclear emulsion detectors based on GPU technology
Ariga, A
2013-01-01
Fast 4$\\pi$ solid angle particle track recognition has been a challenge in particle physics for a long time, especially in using nuclear emulsion detectors. The recent advances in computing technology opened the way for its realization. A fast 4$\\pi$ solid angle particle track reconstruction based on GPU technology combined with a multithread programming is reported here with a detailed comparison between GPU-based and CPU-based programming. A 60 times faster processing of 3D emulsion detector data, corresponding to processing of 15 cm$^2$ emulsion surface scanned per hour, has been achieved by GPUs with an excellent tracking performance.
Optimization strategies for parallel CPU and GPU implementations of a meshfree particle method
Domínguez, Jose M; Gómez-Gesteira, Moncho
2011-01-01
Much of the current focus in high performance computing (HPC) for computational fluid dynamics (CFD) deals with grid based methods. However, parallel implementations for new meshfree particle methods such as Smoothed Particle Hydrodynamics (SPH) are less studied. In this work, we present optimizations for both central processing unit (CPU) and graphics processing unit (GPU) of a SPH method. These optimization strategies can be further applied to many other meshfree methods. The obtained performance for each architecture and a comparison between the most efficient implementations for CPU and GPU are shown.
High Performance Processing and Analysis of Geospatial Data Using CUDA on GPU
Directory of Open Access Journals (Sweden)
STOJANOVIC, N.
2014-11-01
Full Text Available In this paper, the high-performance processing of massive geospatial data on many-core GPU (Graphic Processing Unit is presented. We use CUDA (Compute Unified Device Architecture programming framework to implement parallel processing of common Geographic Information Systems (GIS algorithms, such as viewshed analysis and map-matching. Experimental evaluation indicates the improvement in performance with respect to CPU-based solutions and shows feasibility of using GPU and CUDA for parallel implementation of GIS algorithms over large-scale geospatial datasets.
Performance of new GPU-based scan-conversion algorithm implemented using OpenGL.
Steelman, William A; Richard, William D
2011-04-01
A new GPU-based scan-conversion algorithm implemented using OpenGL is described. The compute performance of this new algorithm running on a modem GPU is compared to the performance of three common scan-conversion algorithms (nearest-neighbor, linear interpolation and bilinear interpolation) implemented in software using a modem CPU. The quality of the images produced by the algorithm, as measured by signal-to-noise power, is also compared to the quality of the images produced using these three common scan-conversion algorithms.
New Multithreaded Hybrid CPU/GPU Approach to Hartree-Fock.
Asadchev, Andrey; Gordon, Mark S
2012-11-13
In this article, a new multithreaded Hartree-Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version
GPU based cloud system for high-performance arrhythmia detection with parallel k-NN algorithm.
Tae Joon Jun; Hyun Ji Park; Hyuk Yoo; Young-Hak Kim; Daeyoung Kim
2016-08-01
In this paper, we propose an GPU based Cloud system for high-performance arrhythmia detection. Pan-Tompkins algorithm is used for QRS detection and we optimized beat classification algorithm with K-Nearest Neighbor (K-NN). To support high performance beat classification on the system, we parallelized beat classification algorithm with CUDA to execute the algorithm on virtualized GPU devices on the Cloud system. MIT-BIH Arrhythmia database is used for validation of the algorithm. The system achieved about 93.5% of detection rate which is comparable to previous researches while our algorithm shows 2.5 times faster execution time compared to CPU only detection algorithm.
Complex fluid flow modeling with SPH on GPU
Bilotta, Giuseppe; Hérault, Alexis; Del Negro, Ciro; Russo, Giovanni; Vicari, Annamaria
2010-05-01
SPH meshless method. In comparison to other particle methods, SPH also provides additional benefits such as the automatic preservation of mass. The direct computation of most physical quantities (e.g. pressure) without resorting to large, sparse implicit systems makes SPH particularly favorable to implementation on highly parallel computational hardware such as modern video cards. The graphical processing units (GPUs) on modern video cards often surpasses the computational power of the CPU that drives them. The CUDA architecture, introduced by NVIDIA in the spring of 2007, allows generic GPU programming with an extension of the C language, making it easy to write highly parallelized code. Our lava simulation model uses the SPH method with a pure GPU implementation in CUDA to achieve high computational performance, modeling both the dynamic and thermal aspects of a lava flow. The dynamic parts of the SPH algorithms are based on the ones of the SPHysics simulator, enhanced to include the treatment of non-Newtonian fluids, the integration of thermal effects including temperature-dependent rheological parameters, and an optimal handling of large-scale natural topographies. For the non-Newtonian rheologies priority is given to the power law recently brought into light by physical modeling of lava flows. For the thermal part of the model, the SPH model has been compared with classical finite elements to simulate a lava lake solidification, a problem for which an analytical solution is known. The comparison shows the significantly higher accuracy of the SPH method in proximity of the contact area of two or more solidification fronts.
Fast GPU based adaptive filtering of 4D echocardiography.
Broxvall, Mathias; Emilsson, Kent; Thunberg, Per
2012-06-01
Time resolved three-dimensional (3D) echocardiography generates four-dimensional (3D+time) data sets that bring new possibilities in clinical practice. Image quality of four-dimensional (4D) echocardiography is however regarded as poorer compared to conventional echocardiography where time-resolved 2D imaging is used. Advanced image processing filtering methods can be used to achieve image improvements but to the cost of heavy data processing. The recent development of graphics processing unit (GPUs) enables highly parallel general purpose computations, that considerably reduces the computational time of advanced image filtering methods. In this study multidimensional adaptive filtering of 4D echocardiography was performed using GPUs. Filtering was done using multiple kernels implemented in OpenCL (open computing language) working on multiple subsets of the data. Our results show a substantial speed increase of up to 74 times, resulting in a total filtering time less than 30 s on a common desktop. This implies that advanced adaptive image processing can be accomplished in conjunction with a clinical examination. Since the presented GPU processor method scales linearly with the number of processing elements, we expect it to continue scaling with the expected future increases in number of processing elements. This should be contrasted with the increases in data set sizes in the near future following the further improvements in ultrasound probes and measuring devices. It is concluded that GPUs facilitate the use of demanding adaptive image filtering techniques that in turn enhance 4D echocardiographic data sets. The presented general methodology of implementing parallelism using GPUs is also applicable for other medical modalities that generate multidimensional data.
Mobile Devices and GPU Parallelism in Ionospheric Data Processing
Mascharka, D.; Pankratius, V.
2015-12-01
Scientific data acquisition in the field is often constrained by data transfer backchannels to analysis environments. Geoscientists are therefore facing practical bottlenecks with increasing sensor density and variety. Mobile devices, such as smartphones and tablets, offer promising solutions to key problems in scientific data acquisition, pre-processing, and validation by providing advanced capabilities in the field. This is due to affordable network connectivity options and the increasing mobile computational power. This contribution exemplifies a scenario faced by scientists in the field and presents the "Mahali TEC Processing App" developed in the context of the NSF-funded Mahali project. Aimed at atmospheric science and the study of ionospheric Total Electron Content (TEC), this app is able to gather data from various dual-frequency GPS receivers. It demonstrates parsing of full-day RINEX files on mobile devices and on-the-fly computation of vertical TEC values based on satellite ephemeris models that are obtained from NASA. Our experiments show how parallel computing on the mobile device GPU enables fast processing and visualization of up to 2 million datapoints in real-time using OpenGL. GPS receiver bias is estimated through minimum TEC approximations that can be interactively adjusted by scientists in the graphical user interface. Scientists can also perform approximate computations for "quickviews" to reduce CPU processing time and memory consumption. In the final stage of our mobile processing pipeline, scientists can upload data to the cloud for further processing. Acknowledgements: The Mahali project (http://mahali.mit.edu) is funded by the NSF INSPIRE grant no. AGS-1343967 (PI: V. Pankratius). We would like to acknowledge our collaborators at Boston College, Virginia Tech, Johns Hopkins University, Colorado State University, as well as the support of UNAVCO for loans of dual-frequency GPS receivers for use in this project, and Intel for loans of
GPU Accelerated Automated Feature Extraction From Satellite Images
Directory of Open Access Journals (Sweden)
K. Phani Tejaswi
2013-04-01
results demonstrate that substantial enhancement inperformance margin can be achieved with thebestutilization of GPU resources and an efficientparallelization strategy.Performance results in comparison with the conventional computing scenario haveprovided a speedup of20x, on realization of this parallelizing strategy
Kohno, R; Hotta, K; Nishioka, S; Matsubara, K; Tansho, R; Suzuki, T
2011-11-21
We implemented the simplified Monte Carlo (SMC) method on graphics processing unit (GPU) architecture under the computer-unified device architecture platform developed by NVIDIA. The GPU-based SMC was clinically applied for four patients with head and neck, lung, or prostate cancer. The results were compared to those obtained by a traditional CPU-based SMC with respect to the computation time and discrepancy. In the CPU- and GPU-based SMC calculations, the estimated mean statistical errors of the calculated doses in the planning target volume region were within 0.5% rms. The dose distributions calculated by the GPU- and CPU-based SMCs were similar, within statistical errors. The GPU-based SMC showed 12.30-16.00 times faster performance than the CPU-based SMC. The computation time per beam arrangement using the GPU-based SMC for the clinical cases ranged 9-67 s. The results demonstrate the successful application of the GPU-based SMC to a clinical proton treatment planning.
Hallock, Michael J; Stone, John E; Roberts, Elijah; Fry, Corey; Luthey-Schulten, Zaida
2014-05-01
Simulation of in vivo cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
SU-E-T-493: Accelerated Monte Carlo Methods for Photon Dosimetry Using a Dual-GPU System and CUDA.
Liu, T; Ding, A; Xu, X
2012-06-01
To develop a Graphics Processing Unit (GPU) based Monte Carlo (MC) code that accelerates dose calculations on a dual-GPU system. We simulated a clinical case of prostate cancer treatment. A voxelized abdomen phantom derived from 120 CT slices was used containing 218×126×60 voxels, and a GE LightSpeed 16-MDCT scanner was modeled. A CPU version of the MC code was first developed in C++ and tested on Intel Xeon X5660 2.8GHz CPU, then it was translated into GPU version using CUDA C 4.1 and run on a dual Tesla m(2) 090 GPU system. The code was featured with automatic assignment of simulation task to multiple GPUs, as well as accurate calculation of energy- and material- dependent cross-sections. Double-precision floating point format was used for accuracy. Doses to the rectum, prostate, bladder and femoral heads were calculated. When running on a single GPU, the MC GPU code was found to be ×19 times faster than the CPU code and ×42 times faster than MCNPX. These speedup factors were doubled on the dual-GPU system. The dose Result was benchmarked against MCNPX and a maximum difference of 1% was observed when the relative error is kept below 0.1%. A GPU-based MC code was developed for dose calculations using detailed patient and CT scanner models. Efficiency and accuracy were both guaranteed in this code. Scalability of the code was confirmed on the dual-GPU system. © 2012 American Association of Physicists in Medicine.
Surface and Curve Skeletonization of Large 3D Models on the GPU
Jalba, Andrei C.; Kustra, Jacek; Telea, Alexandru C.
2013-01-01
We present a GPU-based framework for extracting surface and curve skeletons of 3D shapes represented as large polygonal meshes. We use an efficient parallel search strategy to compute point-cloud skeletons and their distance and feature transforms (FTs) with user-defined precision. We regularize ske
Implementation of the CA-CFAR algorithm for pulsed-doppler radar on a GPU architecture
CSIR Research Space (South Africa)
Venter, CJ
2011-12-01
Full Text Available to gradually explore opportunities for parallel execution and optimization by implementing the algorithm first in MATLAB (CPU), followed by native C (CPU) and finally NVIDIA CUDA (GPU) environments. Three techniques for implementing the CA-CFAR in software were...
GPU Implementation of Bayesian Neural Network Construction for Data-Intensive Applications
Perry, Michelle; Prosper, Harrison B.; Meyer-Baese, Anke
2014-06-01
We describe a graphical processing unit (GPU) implementation of the Hybrid Markov Chain Monte Carlo (HMC) method for training Bayesian Neural Networks (BNN). Our implementation uses NVIDIA's parallel computing architecture, CUDA. We briefly review BNNs and the HMC method and we describe our implementations and give preliminary results.
Development of a GPU-accelerated MIKE 21 Solver for Water Wave Dynamics
DEFF Research Database (Denmark)
Aackermann, Peter Edward; Pedersen, Peter Juhler Dinesen; Engsig-Karup, Allan Peter;
2013-01-01
With encouragement by the company DHI are the aim of this B.Sc. thesis1 to investigate, whether if it is possible to accelerate the simulation speed of DHIs commercial product MIKE 21 HD, by formulating a parallel solution scheme and implementing it to be executed on a CUDA-enabled GPU (massive...
GPU-SD and DPD Parallelization for Gromacs tools for molecular dynamics simulations
Goga, Nicolae; Marrink, Siewert; Cioromela, Ruxandra; Moldoveanu, Florica
2012-01-01
This article presents the GPU parallelization of new algorithms SD and DPD types for molecular dynamics systems developed by the Molecular Dynamics Group, University of Groningen, the Netherlands. One should note that molecular dynamics simulations are time-consuming simulations of systems, running
NBODY6++GPU: Ready for the gravitational million-body problem
Wang, Long; Aarseth, Sverre; Nitadori, Keigo; Berczik, Peter; Kouwenhoven, M B N; Naab, Thorsten
2015-01-01
Accurate direct $N$-body simulations help to obtain detailed information about the dynamical evolution of star clusters. They also enable comparisons with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a well-known direct $N$-body code for star clusters, and NBODY6++ is the extended version designed for large particle number simulations by supercomputers. We present NBODY6++GPU, an optimized version of NBODY6++ with hybrid parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large direct $N$-body simulations, and in particular to solve the million-body problem. We discuss the new features of the NBODY6++GPU code, benchmarks, as well as the first results from a simulation of a realistic globular cluster initially containing a million particles. For million-body simulations, NBODY6++GPU is $400-2000$ times faster than NBODY6 with 320 CPU cores and 32 NVIDIA K20X GPUs. With this computing cluster specification, the simulations of million-body globular clusters including $5...
Valdez-Balderas, Daniel; Rogers, Benedict D; Crespo, Alejandro J C
2012-01-01
Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH) code DualSPHysics, a multi-GPU SPH program is developed for free-surface flows. The approach is based on a spatial decomposition technique, whereby different portions (sub-domains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of Message Passing Interface (MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPU particle migration and sub-domain halo building (which enables interaction between SPH particles of different subdomains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations that could also be performed on a single GPU, but they can now be performed even faster than on one of these devices alone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the current architec...
Parallel Implementation of Color Based Image Retrieval Using CUDA on the GPU
Directory of Open Access Journals (Sweden)
Hadis Heidari
2013-12-01
Full Text Available Most image processing algorithms are inherently parallel, so multithreading processors are suitable in such applications. In huge image databases, image processing takes very long time for run on a single core processor because of single thread execution of algorithms. Graphical Processors Units (GPU is more common in most image processing applications due to multithread execution of algorithms, programmability and low cost. In this paper we implement color based image retrieval system in parallel using Compute Unified Device Architecture (CUDA programming model to run on GPU. The main goal of this research work is to parallelize the process of color based image retrieval through color moments; also whole process is much faster than normal. Our work uses extensive usage of highly multithreaded architecture of multi-cored GPU. An efficient use of shared memory is needed to optimize parallel reduction in CUDA. We evaluated the retrieval of the proposed technique using Recall, Precision, and Average Precision measures. Experimental results showed that parallel implementation led to an average speed up of 6.305×over the serial implementation when running on a NVIDIA GPU GeForce 610M. The average Precision and the average Recall of presented method are 53.84% and 55.00% respectively.
Cell-based Adaptive Mesh Refinement on the GPU with Applications to Exascale Supercomputing
Trujillo, Dennis; Robey, Robert; Davis, Neal; Nicholaeff, David
2011-10-01
We present an OpenCL implementation of a cell-based adaptive mesh refinement (AMR) scheme for the shallow water equations. The challenges associated with ensuring the locality of algorithm architecture to fully exploit the massive number of parallel threads on the GPU is discussed. This includes a proof of concept that a cell-based AMR code can be effectively implemented, even on a small scale, in the memory and threading model provided by OpenCL. Additionally, the program requires dynamic memory in order to properly implement the mesh; as this is not supported in the OpenCL 1.1 standard, a combination of CPU memory management and GPU computation effectively implements a dynamic memory allocation scheme. Load balancing is achieved through a new stencil-based implementation of a space-filling curve, eliminating the need for a complete recalculation of the indexing on the mesh. A cartesian grid hash table scheme to allow fast parallel neighbor accesses is also discussed. Finally, the relative speedup of the GPU-enabled AMR code is compared to the original serial version. We conclude that parallelization using the GPU provides significant speedup for typical numerical applications and is feasible for scientific applications in the next generation of supercomputing.
Fast Streaming 3D Level set Segmentation on the GPU for Smooth Multi-phase Segmentation
DEFF Research Database (Denmark)
Sharma, Ojaswa; Zhang, Qin; Anton, François
2011-01-01
Level set method based segmentation provides an efficient tool for topological and geometrical shape handling, but it is slow due to high computational burden. In this work, we provide a framework for streaming computations on large volumetric images on the GPU. A streaming computational model...
Accelerating Satellite Image Based Large-Scale Settlement Detection with GPU
Energy Technology Data Exchange (ETDEWEB)
Patlolla, Dilip Reddy [ORNL; Cheriyadat, Anil M [ORNL; Weaver, Jeanette E [ORNL; Bright, Eddie A [ORNL
2012-01-01
Computer vision algorithms for image analysis are often computationally demanding. Application of such algorithms on large image databases\\---- such as the high-resolution satellite imagery covering the entire land surface, can easily saturate the computational capabilities of conventional CPUs. There is a great demand for vision algorithms running on high performance computing (HPC) architecture capable of processing petascale image data. We exploit the parallel processing capability of GPUs to present a GPU-friendly algorithm for robust and efficient detection of settlements from large-scale high-resolution satellite imagery. Feature descriptor generation is an expensive, but a key step in automated scene analysis. To address this challenge, we present GPU implementations for three different feature descriptors\\-- multiscale Historgram of Oriented Gradients (HOG), Gray Level Co-Occurrence Matrix (GLCM) Contrast and local pixel intensity statistics. We perform extensive experimental evaluations of our implementation using diverse and large image datasets. Our GPU implementation of the feature descriptor algorithms results in speedups of 220 times compared to the CPU version. We present an highly efficient settlement detection system running on a multiGPU architecture capable of extracting human settlement regions from a city-scale sub-meter spatial resolution aerial imagery spanning roughly 1200 sq. kilometers in just 56 seconds with detection accuracy close to 90\\%. This remarkable speedup gained by our vision algorithm maintaining high detection accuracy clearly demonstrates that such computational advancements clearly hold the solution for petascale image analysis challenges.
Length-Bounded Hybrid CPU/GPU Pattern Matching Algorithm for Deep Packet Inspection
Directory of Open Access Journals (Sweden)
Yi-Shan Lin
2017-01-01
Full Text Available Since frequent communication between applications takes place in high speed networks, deep packet inspection (DPI plays an important role in the network application awareness. The signature-based network intrusion detection system (NIDS contains a DPI technique that examines the incoming packet payloads by employing a pattern matching algorithm that dominates the overall inspection performance. Existing studies focused on implementing efficient pattern matching algorithms by parallel programming on software platforms because of the advantages of lower cost and higher scalability. Either the central processing unit (CPU or the graphic processing unit (GPU were involved. Our studies focused on designing a pattern matching algorithm based on the cooperation between both CPU and GPU. In this paper, we present an enhanced design for our previous work, a length-bounded hybrid CPU/GPU pattern matching algorithm (LHPMA. In the preliminary experiment, the performance and comparison with the previous work are displayed, and the experimental results show that the LHPMA can achieve not only effective CPU/GPU cooperation but also higher throughput than the previous method.
Directory of Open Access Journals (Sweden)
M. Huang
2015-09-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
A GPU-based Real-time Software Correlation System for the Murchison Widefield Array Prototype
Wayth, Randall B.; Greenhill, Lincoln J.; Briggs, Frank H.
2009-08-01
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory, and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32-antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximizes the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking, and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom-built hardware, and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32-antenna prototype system where it ran in real time for extended periods. We briefly describe the data capture, streaming, and correlation system for the prototype array.
GPU acceleration of the stochastic grid bundling method for early-exercise options
A. Leitao Rodriguez (Álvaro); C.W. Oosterlee (Cornelis)
2015-01-01
htmlabstractIn this work, a parallel graphics processing units (GPU) version of the Monte Carlo stochastic grid bundling method (SGBM) for pricing multi-dimensional early-exercise options is presented. To extend the method’s applicability, the problem dimensions and the number of bundles will be inc
A GPU-accelerated Direct-sum Boundary Integral Poisson-Boltzmann Solver
Geng, Weihua
2013-01-01
In this paper, we present a GPU-accelerated direct-sum boundary integral method to solve the linear Poisson-Boltzmann (PB) equation. In our method, a well-posed boundary integral formulation is used to ensure the fast convergence of Krylov subspace based linear algebraic solver such as the GMRES. The molecular surfaces are discretized with flat triangles and centroid collocation. To speed up our method, we take advantage of the parallel nature of the boundary integral formulation and parallelize the schemes within CUDA shared memory architecture on GPU. The schemes use only $11N+6N_c$ size-of-double device memory for a biomolecule with $N$ triangular surface elements and $N_c$ partial charges. Numerical tests of these schemes show well-maintained accuracy and fast convergence. The GPU implementation using one GPU card (Nvidia Tesla M2070) achieves 120-150X speed-up to the implementation using one CPU (Intel L5640 2.27GHz). With our approach, solving PB equations on well-discretized molecular surfaces with up ...
Multi-GPU implementation of a VMAT treatment plan optimization algorithm
Tian, Zhen; Folkerts, Michael; Tan, Jun; Jia, Xun; Jiang, Steve B
2015-01-01
VMAT optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. High-performance graphics processing units have been used to speed up the computations. However, its small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix. This paper is to report an implementation of our column-generation based VMAT algorithm on a multi-GPU platform to solve the memory limitation problem. The column-generation approach generates apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. The DDC matrix is split into four sub-matrices according to beam angles, stored on four GPUs in compressed sparse row format. Computation of beamlet price is accomplished using multi-GPU. While the remaining steps of PP and MP problems are implemented on a single GPU due to their modest computational loads. A H&N patient case was used to validate our method. We compare our multi-GPU implemen...
GPU computing with Kaczmarz's and other iterative algorithms for linear systems.
Elble, Joseph M; Sahinidis, Nikolaos V; Vouzis, Panagiotis
2010-06-01
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz's, Cimmino's, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.
Huang, M.; Mielikainen, J.; Huang, B.; Chen, H.; Huang, H.-L. A.; Goldberg, M. D.
2015-09-01
The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using graphics processing unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its central processing unit (CPU)-based implementation. This paper presents our efficient GPU-based design on a WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to 1 CPU core is only 3.5×. We can even boost the speedup to 360× with respect to 1 CPU core as two K40 GPUs are applied.
GPU acceleration of the stochastic grid bundling method for early-exercise options
Leitao Rodriguez, A.; Oosterlee, C.W.
2015-01-01
In this work, a parallel graphics processing units (GPU) version of the Monte Carlo stochastic grid bundling method (SGBM) for pricing multi-dimensional early-exercise options is presented. To extend the method’s applicability, the problem dimensions and the number of bundles will be increased drast
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.
Directory of Open Access Journals (Sweden)
Cheng-Ying Chou
Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
A New Approach to Reduce Memory Consumption in Lattice Boltzmann Method on GPU
Directory of Open Access Journals (Sweden)
Mojtaba Sheida
2017-01-01
Full Text Available Several efforts have been performed to improve LBM defects related to its computational performance. In this work, a new algorithm has been introduced to reduce memory consumption. In the past, most LBM developers have not paid enough attention to retain LBM simplicity in their modified version, while it has been one of the main concerns in developing of the present algorithm. Note, there is also a deficiency in our new algorithm. Besides the memory reduction, because of high memory call back from the main memory, some computational efficiency reduction occurs. To overcome this difficulty, an optimization approach has been introduced, which has recovered this efficiency to the original two-steps two-lattice LBM. This is accomplished by a trade-off between memory reduction and computational performance. To keep a suitable computational efficiency, memory reduction has reached to about 33% in D2Q9 and 42% in D3Q19. In addition, this approach has been implemented on graphical processing unit (GPU as well. In regard to onboard memory limitation in GPU, the advantage of this new algorithm is enhanced even more (39% in D2Q9 and 45% in D3Q19. Note, because of higher memory bandwidth in GPU, computational performance of our new algorithm using GPU is better than CPU.
Directory of Open Access Journals (Sweden)
M. Huang
2014-11-01
Full Text Available The planetary boundary layer (PBL is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds, and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF model of which the Yonsei University (YSU scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using Graphics Processing Unit (GPU massive parallel computing architecture whilst maintaining its accuracy as compared to its CPU-based implementation. This paper presents our efficient GPU-based design on WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its Central Processing Unit (CPU counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores with respect to one CPU core is only 3.5×. We can even boost the speedup to 360× with respect to one CPU core as two K40 GPUs are applied.
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
Zhang, Jing; Wang, Hao; Feng, Wu-Chun
2017-01-01
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
GPU-based normalized cuts for road extraction using satellite imagery
Indian Academy of Sciences (India)
J Senthilnath; S Sindhu; S N Omkar
2014-12-01
This paper presents a GPU implementation of normalized cuts for road extraction problem using panchromatic satellite imagery. The roads have been extracted in three stages namely pre-processing, image segmentation and post-processing. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, vegetation, and fallow regions). The road regions are then extracted using the normalized cuts algorithm. Normalized cuts algorithm is a graph-based partitioning approach whose focus lies in extracting the global impression (perceptual grouping) of an image rather than local features. For the segmented image, post-processing is carried out using morphological operations – erosion and dilation. Finally, the road extracted image is overlaid on the original image. Here, a GPGPU (General Purpose Graphical Processing Unit) approach has been adopted to implement the same algorithm on the GPU for fast processing. A performance comparison of this proposed GPU implementation of normalized cuts algorithm with the earlier algorithm (CPU implementation) is presented. From the results, we conclude that the computational improvement in terms of time as the size of image increases for the proposed GPU implementation of normalized cuts. Also, a qualitative and quantitative assessment of the segmentation results has been projected.
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster
Energy Technology Data Exchange (ETDEWEB)
Allada, Veerendra, Benjegerdes, Troy; Bode, Brett
2009-08-31
Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as the workhorse subroutine. In this paper, they study the performance of the memory copies and GEMM subroutines that are critical to port the computational chemistry algorithms to the GPU clusters. To that end, a benchmark based on the NetPIPE framework is developed to evaluate the latency and bandwidth of the memory copies between the host and the GPU device. The performance of the single and double precision GEMM subroutines from the NVIDIA CUBLAS 2.0 library are studied. The results have been compared with that of the BLAS routines from the Intel Math Kernel Library (MKL) to understand the computational trade-offs. The test bed is a Intel Xeon cluster equipped with NVIDIA Tesla GPUs.
Fast 2-D ultrasound strain imaging: the benefits of using a GPU.
Idzenga, Tim; Gaburov, Evghenii; Vermin, Willem; Menssen, Jan; de Korte, Chris
2014-01-01
Deformation of tissue can be accurately estimated from radio-frequency ultrasound data using a 2-dimensional normalized cross correlation (NCC)-based algorithm. This procedure, however, is very computationally time-consuming. A major time reduction can be achieved by parallelizing the numerous computations of NCC. In this paper, two approaches for parallelization have been investigated: the OpenMP interface on a multi-CPU system and Compute Unified Device Architecture (CUDA) on a graphics processing unit (GPU). The performance of the OpenMP and GPU approaches were compared with a conventional Matlab implementation of NCC. The OpenMP approach with 8 threads achieved a maximum speed-up factor of 132 on the computing of NCC, whereas the GPU approach on an Nvidia Tesla K20 achieved a maximum speed-up factor of 376. Neither parallelization approach resulted in a significant loss in image quality of the elastograms. Parallelization of the NCC computations using the GPU, therefore, significantly reduces the computation time and increases the frame rate for motion estimation.
SkyAlign: a portable, work-efficient skyline algorithm for multicore and GPU architectures
DEFF Research Database (Denmark)
Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira
2016-01-01
The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely on ...
permGPU: Using graphics processing units in RNA microarray association studies
Directory of Open Access Journals (Sweden)
George Stephen L
2010-06-01
Full Text Available Abstract Background Many analyses of microarray association studies involve permutation, bootstrap resampling and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. Results We have developed a CUDA based implementation, permGPU, that employs graphics processing units in microarray association studies. We illustrate the performance and applicability of permGPU within the context of permutation resampling for a number of test statistics. An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C/C++ solution running on a conventional Linux server. Conclusions permGPU is available as an open-source stand-alone application and as an extension package for the R statistical environment. It provides a dramatic increase in performance for permutation resampling analysis in the context of microarray association studies. The current version offers six test statistics for carrying out permutation resampling analyses for binary, quantitative and censored time-to-event traits.
Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B
2012-09-11
In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.
DEFF Research Database (Denmark)
Damkjær, Jesper; Erleben, Kenny
2009-01-01
hierarchies. Our approach makes it possible to perform non-convex object versus non-convex object collision on the GPU, using tandem traversals of bounding volume hierarchies. Prior work only supports single traversals on GPUs. We introduce a blocked hierarchy data structure, using imaginary nodes...
GPU.proton.DOCK: Genuine Protein Ultrafast proton equilibria consistent DOCKing.
Kantardjiev, Alexander A
2011-07-01
GPU.proton.DOCK (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein-protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms--a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pK(a) values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/.
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL
Shimobaba, Tomoyoshi; Masuda, Nobuyuki; Ichihashi, Yasuyuki; Takada, Naoki
2010-01-01
In this paper, we report fast calculation of a computer-generated-hologram using a new architecture of the HD5000 series GPU (RV870) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 * 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds. The calculation speed realizes a speed approximately two times faster than that of a GPU made by NVIDIA.
Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL.
Shimobaba, Tomoyoshi; Ito, Tomoyoshi; Masuda, Nobuyuki; Ichihashi, Yasuyuki; Takada, Naoki
2010-05-10
In this paper, we report fast calculation of a computer-generated-hologram using a new architecture of the HD5000 series GPU (RV870) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 x 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds. The calculation speed realizes a speed approximately two times faster than that of a GPU made by NVIDIA. (c) 2010 Optical Society of America.
2013-01-01
CUDA * Optimal employment of GPU memory...the GPU using the stream construct within CUDA . Using this technique, a small amount of...input tile data is sent to the GPU initially. Then, while the CUDA kernels process
A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations
Rattermann, Dale Nicholas
Fast Poisson solvers using the Fast Fourier Transform on uniform grids are especially suited for parallel implementation, making them appropriate for portability on graphical processing unit (GPU) devices. The goal of the following work was to implement, test, and evaluate a fast Poisson solver for periodic boundary conditions for use on a variety of GPU configurations. The solver used in this research was FLASH, an immersed-boundary-based method, which is well suited for complex, time-dependent geometries, has robust adaptive mesh refinement/de-refinement capabilities to capture evolving flow structures, and has been successfully implemented on conventional, parallel supercomputers. However, these solvers are still computationally costly to employ, and the total solver time is dominated by the solution of the pressure Poisson equation using state-of-the-art multigrid methods. FLASH improves the performance of its multigrid solvers by integrating a parallel FFT solver on a uniform grid during a coarse level. This hybrid solver could then be theoretically improved by replacing the highly-parallelizable FFT solver with one that utilizes GPUs, and, thus, was the motivation for my research. In the present work, the CPU-utilizing parallel FFT solver (PFFT) used in the base version of FLASH for solving the Poisson equation on uniform grids has been modified to enable parallel execution on CUDA-enabled GPU devices. New algorithms have been implemented to replace the Poisson solver that decompose the computational domain and send each new block to a GPU for parallel computation. One-dimensional (1-D) decomposition of the computational domain minimizes the amount of network traffic involved in this bandwidth-intensive computation by limiting the amount of all-to-all communication required between processes. Advanced techniques have been incorporated and implemented in a GPU-centric code design, while allowing end users the flexibility of parameter control at runtime in
ACCELERATING CALCULATION OF CHOLESKY FACTORISATION OF MATRIX WITH GPU%使用 GPU 加速计算矩阵的 Cholesky 分解
Institute of Scientific and Technical Information of China (English)
沈聪; 高火涛
2016-01-01
A concrete implementation of Cholesky factorisation on graphic processing unit (GPU)for large real symmetric positive definite matrix is described in this article.We analyse the hybrid parallel algorithm presented by Volkov for computing the Cholesky factorisation in de-tail.On that basis,and according to the computational performances of CPU and GPU on our own computers,we present a more reasonable hy-brid three-phase scheduling strategy,which further reduces the idle time of CPU and avoids the occurrence of GPU in idle status.Numerical experiment shows that the new hybrid scheduling algorithm achieves a speedup of more than 5 times compared with the standard MKL algo-rithm when the order of a matrix is larger than 7000,and it also observably outperforms the performance of original Volkov’s hybrid algorithm.%针对大型实对称正定矩阵的 Cholesky 分解问题，给出其在图形处理器（GPU）上的具体实现。详细分析了 Volkov 计算Cholesky 分解的混合并行算法，并在此基础上依据自身计算机的 CPU 以及 GPU 的计算性能，给出一种更为合理的三阶段混合调度方案，进一步减少 CPU 的空闲时间以及避免 GPU 空闲情况的出现。数值实验表明，当矩阵阶数超过7000时，新的混合调度算法相比标准的 MKL 算法获得了超过5倍的加速比，同时对比原 Volkov 混合算法获得了显著的性能提升。
Chi, Yujie; Tian, Zhen; Jia, Xun
2016-08-01
Monte Carlo (MC) particle transport simulation on a graphics-processing unit (GPU) platform has been extensively studied recently due to the efficiency advantage achieved via massive parallelization. Almost all of the existing GPU-based MC packages were developed for voxelized geometry. This limited application scope of these packages. The purpose of this paper is to develop a module to model parametric geometry and integrate it in GPU-based MC simulations. In our module, each continuous region was defined by its bounding surfaces that were parameterized by quadratic functions. Particle navigation functions in this geometry were developed. The module was incorporated to two previously developed GPU-based MC packages and was tested in two example problems: (1) low energy photon transport simulation in a brachytherapy case with a shielded cylinder applicator and (2) MeV coupled photon/electron transport simulation in a phantom containing several inserts of different shapes. In both cases, the calculated dose distributions agreed well with those calculated in the corresponding voxelized geometry. The averaged dose differences were 1.03% and 0.29%, respectively. We also used the developed package to perform simulations of a Varian VS 2000 brachytherapy source and generated a phase-space file. The computation time under the parameterized geometry depended on the memory location storing the geometry data. When the data was stored in GPU's shared memory, the highest computational speed was achieved. Incorporation of parameterized geometry yielded a computation time that was ~3 times of that in the corresponding voxelized geometry. We also developed a strategy to use an auxiliary index array to reduce frequency of geometry calculations and hence improve efficiency. With this strategy, the computational time ranged in 1.75-2.03 times of the voxelized geometry for coupled photon/electron transport depending on the voxel dimension of the auxiliary index array, and in 0
Providing Source Code Level Portability Between CPU and GPU with MapCG
Institute of Scientific and Technical Information of China (English)
Chun-Tao Hong; De-Hao Chen; Yu-Bei Chen; Wen-Guang Chen; Wei-Min Zheng; Hai-Bo Lin
2012-01-01
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific code with low level GPU APIs such as CUDA.Although this approach can achieve good performance,it creates serious portability issues as programmers are required to write a specific version of the code for each potential target architecture.This results in high development and maintenance costs.We believe it is desirable to have a programming model which provides source code portability between CPUs and GPUs,as well as different GPUs.This would allow programmers to write one version of the code,which can be compiled and executed on either CPUs or GPUs efficiently without modification.In this paper,we propose MapCG,a MapReduce framework to provide source code level portability between CPUs and GPUs.In contrast to other approaches such as OpenCL,our framework,based on MapReduce,provides a high level programming model and makes programming much easier.We describe the design of MapCG,including the MapReduce-style high-level programming framework and the runtime system on the CPU and GPU.A prototype of the MapCG runtime,supporting multi-core CPUs and NVIDIA GPUs,was implemented. Our experimental results show that this implementation can execute the same source code efficiently on multi-core CPU platforms and GPUs,achieving an average speedup of 1.6～2.5x over previous implementations of MapReduce on eight commonly used applications.