units dual-gpus architecture: Topics by WorldWideScience.org

Sample records for units dual-gpus architecture

Heterogeneous System Architectures from APUs to discrete GPUs

CERN Multimedia

CERN. Geneva

2013-01-01

We will present the Heterogeneous Systems Architectures that new AMD processors are bringing with the new GCN based GPUs and the new APUs. We will show how together they represent a huge step forward for programming flexibility and performance efficiently for Compute.
Finite Temperature Lattice QCD with GPUs

International Nuclear Information System (INIS)

Cardoso, N.; Cardoso, M.; Bicudo, P.

2011-01-01

Graphics Processing Units (GPUs) are being used in many areas of physics, since the performance versus cost is very attractive. The GPUs can be addressed by CUDA which is a NVIDIA's parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU. We present a performance comparison between the GPU and CPU with single precision and double precision in generating lattice SU(2) configurations. Analyses with single and multiple GPUs, using CUDA and OPENMP, are also presented. We also present SU(2) results for the renormalized Polyakov loop, colour averaged free energy and the string tension as a function of the temperature. (authors)
GPUs, a new tool of acceleration in CFD: efficiency and reliability on smoothed particle hydrodynamics methods.

Directory of Open Access Journals (Sweden)

Alejandro C Crespo

Full Text Available Smoothed Particle Hydrodynamics (SPH is a numerical method commonly used in Computational Fluid Dynamics (CFD to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs or Graphics Processor Units (GPUs, a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability.
Symplectic multi-particle tracking on GPUs

Science.gov (United States)

Liu, Zhicong; Qiang, Ji

2018-05-01

A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
Accelerated radiotherapy planners calculated by parallelization with GPUs

International Nuclear Information System (INIS)

Reinado, D.; Cozar, J.; Alonso, S.; Chinillach, N.; Cortina, T.; Ricos, B.; Diez, S.

2011-01-01

In this paper we have developed and tested by a subroutine parallelization architectures graphics processing units (GPUs) to apply to calculations with standard algorithms known code. The experience acquired during these tests shall also apply to the MC calculations in radiotherapy if you have the code.
Numerical computations with GPUs

CERN Document Server

Kindratenko, Volodymyr

2014-01-01

This book brings together research on numerical methods adapted for Graphics Processing Units (GPUs). It explains recent efforts to adapt classic numerical methods, including solution of linear equations and FFT, for massively parallel GPU architectures. This volume consolidates recent research and adaptations, covering widely used methods that are at the core of many scientific and engineering computations. Each chapter is written by authors working on a specific group of methods; these leading experts provide mathematical background, parallel algorithms and implementation details leading to
Comparison of some parallelization strategies of thermalhydraulic codes on GPUs

International Nuclear Information System (INIS)

Jendoubi, T.; Bergeaud, V.; Geay, A.

2013-01-01

Modern supercomputers architecture is now often based on hybrid concepts combining parallelism to distributed memory, parallelism to shared memory and also to GPUs (Graphic Process Units). In this work, we propose a new approach to take advantage of these graphic cards in thermohydraulics algorithms. (authors)
GPUs for the realtime low-level trigger of the NA62 experiment at CERN

CERN Document Server

Ammendola, R; Biagioni, A; Chiozzi, S; Cotta Ramusino, A; Fantechi, R; Fiorini, M; Gianoli, A; Graverini, E; Lamanna, G; Lonardo, A; Messina, A; Neri, I; Pantaleo, F; Paolucci, P S; Piandani, R; Pontisso, L; Simula, F; Sozzi, M; Vicini, P

2015-01-01

A pilot project for the use of GPUs (Graphics processing units) in online triggering ap- plications for high energy physics experiments (HEP) is presented. GPUs offer a highly parallel architecture and the fact that most of the chip resources are devoted to computa- tion. Moreover, they allow to achieve a large computing power using a limited amount of space and power. The application of online parallel computing on GPUs is shown for the synchronous low level trigger of NA62 experiment at CERN. Direct GPU communication using a FPGA-based board has been exploited to reduce the data transmission latency and results on a first field test at CERN will be highlighted. This work is part of a wider project named GAP (GPU application project), intended to study the use of GPUs in real-time applications in both HEP and medical imagin
A massively parallel method of characteristic neutral particle transport code for GPUs

International Nuclear Information System (INIS)

Boyd, W. R.; Smith, K.; Forget, B.

2013-01-01

Over the past 20 years, parallel computing has enabled computers to grow ever larger and more powerful while scientific applications have advanced in sophistication and resolution. This trend is being challenged, however, as the power consumption for conventional parallel computing architectures has risen to unsustainable levels and memory limitations have come to dominate compute performance. Heterogeneous computing platforms, such as Graphics Processing Units (GPUs), are an increasingly popular paradigm for solving these issues. This paper explores the applicability of GPUs for deterministic neutron transport. A 2D method of characteristics (MOC) code - OpenMOC - has been developed with solvers for both shared memory multi-core platforms as well as GPUs. The multi-threading and memory locality methodologies for the GPU solver are presented. Performance results for the 2D C5G7 benchmark demonstrate 25-35 x speedup for MOC on the GPU. The lessons learned from this case study will provide the basis for further exploration of MOC on GPUs as well as design decisions for hardware vendors exploring technologies for the next generation of machines for scientific computing. (authors)
Exploiting GPUs in Virtual Machine for BioCloud

Science.gov (United States)

Jo, Heeseung; Jeong, Jinkyu; Lee, Myoungho; Choi, Dong Hoon

2013-01-01

Recently, biological applications start to be reimplemented into the applications which exploit many cores of GPUs for better computation performance. Therefore, by providing virtualized GPUs to VMs in cloud computing environment, many biological applications will willingly move into cloud environment to enhance their computation performance and utilize infinite cloud computing resource while reducing expenses for computations. In this paper, we propose a BioCloud system architecture that enables VMs to use GPUs in cloud environment. Because much of the previous research has focused on the sharing mechanism of GPUs among VMs, they cannot achieve enough performance for biological applications of which computation throughput is more crucial rather than sharing. The proposed system exploits the pass-through mode of PCI express (PCI-E) channel. By making each VM be able to access underlying GPUs directly, applications can show almost the same performance as when those are in native environment. In addition, our scheme multiplexes GPUs by using hot plug-in/out device features of PCI-E channel. By adding or removing GPUs in each VM in on-demand manner, VMs in the same physical host can time-share their GPUs. We implemented the proposed system using the Xen VMM and NVIDIA GPUs and showed that our prototype is highly effective for biological GPU applications in cloud environment. PMID:23710465
Exploiting GPUs in Virtual Machine for BioCloud

Directory of Open Access Journals (Sweden)

Heeseung Jo

2013-01-01

Full Text Available Recently, biological applications start to be reimplemented into the applications which exploit many cores of GPUs for better computation performance. Therefore, by providing virtualized GPUs to VMs in cloud computing environment, many biological applications will willingly move into cloud environment to enhance their computation performance and utilize infinite cloud computing resource while reducing expenses for computations. In this paper, we propose a BioCloud system architecture that enables VMs to use GPUs in cloud environment. Because much of the previous research has focused on the sharing mechanism of GPUs among VMs, they cannot achieve enough performance for biological applications of which computation throughput is more crucial rather than sharing. The proposed system exploits the pass-through mode of PCI express (PCI-E channel. By making each VM be able to access underlying GPUs directly, applications can show almost the same performance as when those are in native environment. In addition, our scheme multiplexes GPUs by using hot plug-in/out device features of PCI-E channel. By adding or removing GPUs in each VM in on-demand manner, VMs in the same physical host can time-share their GPUs. We implemented the proposed system using the Xen VMM and NVIDIA GPUs and showed that our prototype is highly effective for biological GPU applications in cloud environment.
Exploiting GPUs in virtual machine for BioCloud.

Science.gov (United States)

Jo, Heeseung; Jeong, Jinkyu; Lee, Myoungho; Choi, Dong Hoon

2013-01-01

Recently, biological applications start to be reimplemented into the applications which exploit many cores of GPUs for better computation performance. Therefore, by providing virtualized GPUs to VMs in cloud computing environment, many biological applications will willingly move into cloud environment to enhance their computation performance and utilize infinite cloud computing resource while reducing expenses for computations. In this paper, we propose a BioCloud system architecture that enables VMs to use GPUs in cloud environment. Because much of the previous research has focused on the sharing mechanism of GPUs among VMs, they cannot achieve enough performance for biological applications of which computation throughput is more crucial rather than sharing. The proposed system exploits the pass-through mode of PCI express (PCI-E) channel. By making each VM be able to access underlying GPUs directly, applications can show almost the same performance as when those are in native environment. In addition, our scheme multiplexes GPUs by using hot plug-in/out device features of PCI-E channel. By adding or removing GPUs in each VM in on-demand manner, VMs in the same physical host can time-share their GPUs. We implemented the proposed system using the Xen VMM and NVIDIA GPUs and showed that our prototype is highly effective for biological GPU applications in cloud environment.
Exploiting GPUs in Virtual Machine for BioCloud

OpenAIRE

Jo, Heeseung; Jeong, Jinkyu; Lee, Myoungho; Choi, Dong Hoon

2013-01-01

Recently, biological applications start to be reimplemented into the applications which exploit many cores of GPUs for better computation performance. Therefore, by providing virtualized GPUs to VMs in cloud computing environment, many biological applications will willingly move into cloud environment to enhance their computation performance and utilize infinite cloud computing resource while reducing expenses for computations. In this paper, we propose a BioCloud system architecture that ena...
Designing scientific applications on GPUs

CERN Document Server

Couturier, Raphael

2013-01-01

Many of today's complex scientific applications now require a vast amount of computational power. General purpose graphics processing units (GPGPUs) enable researchers in a variety of fields to benefit from the computational power of all the cores available inside graphics cards.Understand the Benefits of Using GPUs for Many Scientific ApplicationsDesigning Scientific Applications on GPUs shows you how to use GPUs for applications in diverse scientific fields, from physics and mathematics to computer science. The book explains the methods necessary for designing or porting your scientific appl
Parallel iterative solution of the Hermite Collocation equations on GPUs II

International Nuclear Information System (INIS)

Vilanakis, N; Mathioudakis, E

2014-01-01

Hermite Collocation is a high order finite element method for Boundary Value Problems modelling applications in several fields of science and engineering. Application of this integration free numerical solver for the solution of linear BVPs results in a large and sparse general system of algebraic equations, suggesting the usage of an efficient iterative solver especially for realistic simulations. In part I of this work an efficient parallel algorithm of the Schur complement method coupled with Bi-Conjugate Gradient Stabilized (BiCGSTAB) iterative solver has been designed for multicore computing architectures with a Graphics Processing Unit (GPU). In the present work the proposed algorithm has been extended for high performance computing environments consisting of multiprocessor machines with multiple GPUs. Since this is a distributed GPU and shared CPU memory parallel architecture, a hybrid memory treatment is needed for the development of the parallel algorithm. The realization of the algorithm took place on a multiprocessor machine HP SL390 with Tesla M2070 GPUs using the OpenMP and OpenACC standards. Execution time measurements reveal the efficiency of the parallel implementation
Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Directory of Open Access Journals (Sweden)

Jiankuo Dong

2017-01-01

Full Text Available Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography implementations on Graphics Processing Units (GPUs have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009. The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.
Monte Carlo method for neutron transport calculations in graphics processing units (GPUs)

International Nuclear Information System (INIS)

Pellegrino, Esteban

2011-01-01

Monte Carlo simulation is well suited for solving the Boltzmann neutron transport equation in an inhomogeneous media for complicated geometries. However, routine applications require the computation time to be reduced to hours and even minutes in a desktop PC. The interest in adopting Graphics Processing Units (GPUs) for Monte Carlo acceleration is rapidly growing. This is due to the massive parallelism provided by the latest GPU technologies which is the most promising solution to the challenge of performing full-size reactor core analysis on a routine basis. In this study, Monte Carlo codes for a fixed-source neutron transport problem were developed for GPU environments in order to evaluate issues associated with computational speedup using GPUs. Results obtained in this work suggest that a speedup of several orders of magnitude is possible using the state-of-the-art GPU technologies. (author) [es
SU (2) lattice gauge theory simulations on Fermi GPUs

International Nuclear Information System (INIS)

Cardoso, Nuno; Bicudo, Pedro

2011-01-01

In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU (2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50,000) without smearing and almost 2000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200x the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2x slower) than single precision computations.
Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

KAUST Repository

Ltaief, Hatem

2016-06-02

We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used as an operational testbed for simulating the de- sign of new instruments for the European Extremely Large Telescope project (E-ELT), the world\\'s biggest eye and one of Europe\\'s highest priorities in ground-based astronomy. The simulation corresponds to a multi-step multi-stage pro- cedure, which is fed, near real-time, by system and turbulence data coming from the telescope environment. Based on the PLASMA library powered by the OmpSs dynamic runtime system, our implementation relies on a task-based programming model to permit an asynchronous out-of-order execution. Using modern multicore architectures associated with the enormous computing power of GPUS, the resulting data-driven compute-intensive simulation of the entire MOAO application, composed of the tomographic reconstructor and the observing sequence, is capable of coping with the aforementioned real-time challenge and stands as a reference implementation for the computational astronomy community.
A convolution-superposition dose calculation engine for GPUs

Energy Technology Data Exchange (ETDEWEB)

Hissoiny, Sami; Ozell, Benoit; Despres, Philippe [Departement de genie informatique et genie logiciel, Ecole polytechnique de Montreal, 2500 Chemin de Polytechnique, Montreal, Quebec H3T 1J4 (Canada); Departement de radio-oncologie, CRCHUM-Centre hospitalier de l' Universite de Montreal, 1560 rue Sherbrooke Est, Montreal, Quebec H2L 4M1 (Canada)

2010-03-15

Purpose: Graphic processing units (GPUs) are increasingly used for scientific applications, where their parallel architecture and unprecedented computing power density can be exploited to accelerate calculations. In this paper, a new GPU implementation of a convolution/superposition (CS) algorithm is presented. Methods: This new GPU implementation has been designed from the ground-up to use the graphics card's strengths and to avoid its weaknesses. The CS GPU algorithm takes into account beam hardening, off-axis softening, kernel tilting, and relies heavily on raytracing through patient imaging data. Implementation details are reported as well as a multi-GPU solution. Results: An overall single-GPU acceleration factor of 908x was achieved when compared to a nonoptimized version of the CS algorithm implemented in PlanUNC in single threaded central processing unit (CPU) mode, resulting in approximatively 2.8 s per beam for a 3D dose computation on a 0.4 cm grid. A comparison to an established commercial system leads to an acceleration factor of approximately 29x or 0.58 versus 16.6 s per beam in single threaded mode. An acceleration factor of 46x has been obtained for the total energy released per mass (TERMA) calculation and a 943x acceleration factor for the CS calculation compared to PlanUNC. Dose distributions also have been obtained for a simple water-lung phantom to verify that the implementation gives accurate results. Conclusions: These results suggest that GPUs are an attractive solution for radiation therapy applications and that careful design, taking the GPU architecture into account, is critical in obtaining significant acceleration factors. These results potentially can have a significant impact on complex dose delivery techniques requiring intensive dose calculations such as intensity-modulated radiation therapy (IMRT) and arc therapy. They also are relevant for adaptive radiation therapy where dose results must be obtained rapidly.

Use of GPUs in Trigger Systems

Science.gov (United States)

Lamanna, Gianluca

In recent years the interest for using graphics processor (GPU) in general purpose high performance computing is constantly rising. In this paper we discuss the possible use of GPUs to construct a fast and effective real time trigger system, both in software and hardware levels. In particular, we study the integration of such a system in the NA62 trigger. The first application of GPUs for rings pattern recognition in the RICH will be presented. The results obtained show that there are not showstoppers in trigger systems with relatively low latency. Thanks to the use of off-the-shelf technology, in continous development for purposes related to video game and image processing market, the architecture described would be easily exported to other experiments, to build a versatile and fully customizable online selection.
Evaluation of vectorized Monte Carlo algorithms on GPUs for a neutron Eigenvalue problem

International Nuclear Information System (INIS)

Du, X.; Liu, T.; Ji, W.; Xu, X. G.; Brown, F. B.

2013-01-01

Conventional Monte Carlo (MC) methods for radiation transport computations are 'history-based', which means that one particle history at a time is tracked. Simulations based on such methods suffer from thread divergence on the graphics processing unit (GPU), which severely affects the performance of GPUs. To circumvent this limitation, event-based vectorized MC algorithms can be utilized. A versatile software test-bed, called ARCHER - Accelerated Radiation-transport Computations in Heterogeneous Environments - was used for this study. ARCHER facilitates the development and testing of a MC code based on the vectorized MC algorithm implemented on GPUs by using NVIDIA's Compute Unified Device Architecture (CUDA). The ARCHER GPU code was designed to solve a neutron eigenvalue problem and was tested on a NVIDIA Tesla M2090 Fermi card. We found that although the vectorized MC method significantly reduces the occurrence of divergent branching and enhances the warp execution efficiency, the overall simulation speed is ten times slower than the conventional history-based MC method on GPUs. By analyzing detailed GPU profiling information from ARCHER, we discovered that the main reason was the large amount of global memory transactions, causing severe memory access latency. Several possible solutions to alleviate the memory latency issue are discussed. (authors)
Evaluation of vectorized Monte Carlo algorithms on GPUs for a neutron Eigenvalue problem

Energy Technology Data Exchange (ETDEWEB)

Du, X.; Liu, T.; Ji, W.; Xu, X. G. [Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, NY 12180 (United States); Brown, F. B. [Monte Carlo Codes Group, Los Alamos National Laboratory, Los Alamos, NM 87545 (United States)

2013-07-01

Conventional Monte Carlo (MC) methods for radiation transport computations are 'history-based', which means that one particle history at a time is tracked. Simulations based on such methods suffer from thread divergence on the graphics processing unit (GPU), which severely affects the performance of GPUs. To circumvent this limitation, event-based vectorized MC algorithms can be utilized. A versatile software test-bed, called ARCHER - Accelerated Radiation-transport Computations in Heterogeneous Environments - was used for this study. ARCHER facilitates the development and testing of a MC code based on the vectorized MC algorithm implemented on GPUs by using NVIDIA's Compute Unified Device Architecture (CUDA). The ARCHER{sub GPU} code was designed to solve a neutron eigenvalue problem and was tested on a NVIDIA Tesla M2090 Fermi card. We found that although the vectorized MC method significantly reduces the occurrence of divergent branching and enhances the warp execution efficiency, the overall simulation speed is ten times slower than the conventional history-based MC method on GPUs. By analyzing detailed GPU profiling information from ARCHER, we discovered that the main reason was the large amount of global memory transactions, causing severe memory access latency. Several possible solutions to alleviate the memory latency issue are discussed. (authors)
Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

OpenAIRE

Lee, Yunsup

2016-01-01

As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and ...
A Dual Launch Robotic and Human Lunar Mission Architecture

Science.gov (United States)

Jones, David L.; Mulqueen, Jack; Percy, Tom; Griffin, Brand; Smitherman, David

2010-01-01

This paper describes a comprehensive lunar exploration architecture developed by Marshall Space Flight Center's Advanced Concepts Office that features a science-based surface exploration strategy and a transportation architecture that uses two launches of a heavy lift launch vehicle to deliver human and robotic mission systems to the moon. The principal advantage of the dual launch lunar mission strategy is the reduced cost and risk resulting from the development of just one launch vehicle system. The dual launch lunar mission architecture may also enhance opportunities for commercial and international partnerships by using expendable launch vehicle services for robotic missions or development of surface exploration elements. Furthermore, this architecture is particularly suited to the integration of robotic and human exploration to maximize science return. For surface operations, an innovative dual-mode rover is presented that is capable of performing robotic science exploration as well as transporting human crew conducting surface exploration. The dual-mode rover can be deployed to the lunar surface to perform precursor science activities, collect samples, scout potential crew landing sites, and meet the crew at a designated landing site. With this approach, the crew is able to evaluate the robotically collected samples to select the best samples for return to Earth to maximize the scientific value. The rovers can continue robotic exploration after the crew leaves the lunar surface. The transportation system for the dual launch mission architecture uses a lunar-orbit-rendezvous strategy. Two heavy lift launch vehicles depart from Earth within a six hour period to transport the lunar lander and crew elements separately to lunar orbit. In lunar orbit, the crew transfer vehicle docks with the lander and the crew boards the lander for descent to the surface. After the surface mission, the crew returns to the orbiting transfer vehicle for the return to the Earth. This
Real-time radar signal processing using GPGPU (general-purpose graphic processing unit)

Science.gov (United States)

Kong, Fanxing; Zhang, Yan Rockee; Cai, Jingxiao; Palmer, Robert D.

2016-05-01

This study introduces a practical approach to develop real-time signal processing chain for general phased array radar on NVIDIA GPUs(Graphical Processing Units) using CUDA (Compute Unified Device Architecture) libraries such as cuBlas and cuFFT, which are adopted from open source libraries and optimized for the NVIDIA GPUs. The processed results are rigorously verified against those from the CPUs. Performance benchmarked in computation time with various input data cube sizes are compared across GPUs and CPUs. Through the analysis, it will be demonstrated that GPGPUs (General Purpose GPU) real-time processing of the array radar data is possible with relatively low-cost commercial GPUs.
Deep Packet/Flow Analysis using GPUs

Energy Technology Data Exchange (ETDEWEB)

Gong, Qian [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States); Wu, Wenji [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States); DeMar, Phil [Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)

2017-11-12

Deep packet inspection (DPI) faces severe performance challenges in high-speed networks (40/100 GE) as it requires a large amount of raw computing power and high I/O throughputs. Recently, researchers have tentatively used GPUs to address the above issues and boost the performance of DPI. Typically, DPI applications involve highly complex operations in both per-packet and per-flow data level, often in real-time. The parallel architecture of GPUs fits exceptionally well for per-packet network traffic processing. However, for stateful network protocols such as TCP, their data stream need to be reconstructed in a per-flow level to deliver a consistent content analysis. Since the flow-centric operations are naturally antiparallel and often require large memory space for buffering out-of-sequence packets, they can be problematic for GPUs, whose memory is normally limited to several gigabytes. In this work, we present a highly efficient GPU-based deep packet/flow analysis framework. The proposed design includes a purely GPU-implemented flow tracking and TCP stream reassembly. Instead of buffering and waiting for TCP packets to become in sequence, our framework process the packets in batch and uses a deterministic finite automaton (DFA) with prefix-/suffix- tree method to detect patterns across out-of-sequence packets that happen to be located in different batches. In conclusion, evaluation shows that our code can reassemble and forward tens of millions of packets per second and conduct a stateful signature-based deep packet inspection at 55 Gbit/s using an NVIDIA K40 GPU.
Accelerating Scientific Applications using High Performance Dense and Sparse Linear Algebra Kernels on GPUs

KAUST Repository

Abdelfattah, Ahmad

2015-01-15

High performance computing (HPC) platforms are evolving to more heterogeneous configurations to support the workloads of various applications. The current hardware landscape is composed of traditional multicore CPUs equipped with hardware accelerators that can handle high levels of parallelism. Graphical Processing Units (GPUs) are popular high performance hardware accelerators in modern supercomputers. GPU programming has a different model than that for CPUs, which means that many numerical kernels have to be redesigned and optimized specifically for this architecture. GPUs usually outperform multicore CPUs in some compute intensive and massively parallel applications that have regular processing patterns. However, most scientific applications rely on crucial memory-bound kernels and may witness bottlenecks due to the overhead of the memory bus latency. They can still take advantage of the GPU compute power capabilities, provided that an efficient architecture-aware design is achieved. This dissertation presents a uniform design strategy for optimizing critical memory-bound kernels on GPUs. Based on hierarchical register blocking, double buffering and latency hiding techniques, this strategy leverages the performance of a wide range of standard numerical kernels found in dense and sparse linear algebra libraries. The work presented here focuses on matrix-vector multiplication kernels (MVM) as repre- sentative and most important memory-bound operations in this context. Each kernel inherits the benefits of the proposed strategies. By exposing a proper set of tuning parameters, the strategy is flexible enough to suit different types of matrices, ranging from large dense matrices, to sparse matrices with dense block structures, while high performance is maintained. Furthermore, the tuning parameters are used to maintain the relative performance across different GPU architectures. Multi-GPU acceleration is proposed to scale the performance on several devices. The
Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud Computing

Science.gov (United States)

Ford, Eric B.; Dindar, Saleh; Peters, Jorg

2015-08-01

The realism of astrophysical simulations and statistical analyses of astronomical data are set by the available computational resources. Thus, astronomers and astrophysicists are constantly pushing the limits of computational capabilities. For decades, astronomers benefited from massive improvements in computational power that were driven primarily by increasing clock speeds and required relatively little attention to details of the computational hardware. For nearly a decade, increases in computational capabilities have come primarily from increasing the degree of parallelism, rather than increasing clock speeds. Further increases in computational capabilities will likely be led by many-core architectures such as Graphical Processing Units (GPUs) and Intel Xeon Phi. Successfully harnessing these new architectures, requires significantly more understanding of the hardware architecture, cache hierarchy, compiler capabilities and network network characteristics.I will provide an astronomer's overview of the opportunities and challenges provided by modern many-core architectures and elastic cloud computing. The primary goal is to help an astronomical audience understand what types of problems are likely to yield more than order of magnitude speed-ups and which problems are unlikely to parallelize sufficiently efficiently to be worth the development time and/or costs.I will draw on my experience leading a team in developing the Swarm-NG library for parallel integration of large ensembles of small n-body systems on GPUs, as well as several smaller software projects. I will share lessons learned from collaborating with computer scientists, including both technical and soft skills. Finally, I will discuss the challenges of training the next generation of astronomers to be proficient in this new era of high-performance computing, drawing on experience teaching a graduate class on High-Performance Scientific Computing for Astrophysics and organizing a 2014 advanced summer
A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Directory of Open Access Journals (Sweden)

Guixia He

2016-01-01

Full Text Available Sparse matrix-vector multiplication (SpMV is an important operation in scientific computations. Compressed sparse row (CSR is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs, for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing and CSR-vector (partial coalescing. Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.
Confabulation Based Real-time Anomaly Detection for Wide-area Surveillance Using Heterogeneous High Performance Computing Architecture

Science.gov (United States)

2015-06-01

CONFABULATION BASED REAL-TIME ANOMALY DETECTION FOR WIDE-AREA SURVEILLANCE USING HETEROGENEOUS HIGH PERFORMANCE COMPUTING ARCHITECTURE SYRACUSE...DETECTION FOR WIDE-AREA SURVEILLANCE USING HETEROGENEOUS HIGH PERFORMANCE COMPUTING ARCHITECTURE 5a. CONTRACT NUMBER FA8750-12-1-0251 5b. GRANT...processors including graphic processor units (GPUs) and Intel Xeon Phi processors. Experimental results showed significant speedups, which can enable
Musrfit-Real Time Parameter Fitting Using GPUs

Science.gov (United States)

Locans, Uldis; Suter, Andreas

High transverse field μSR (HTF-μSR) experiments typically lead to a rather large data sets, since it is necessary to follow the high frequencies present in the positron decay histograms. The analysis of these data sets can be very time consuming, usually due to the limited computational power of the hardware. To overcome the limited computing resources rotating reference frame transformation (RRF) is often used to reduce the data sets that need to be handled. This comes at a price typically the μSR community is not aware of: (i) due to the RRF transformation the fitting parameter estimate is of poorer precision, i.e., more extended expensive beamtime is needed. (ii) RRF introduces systematic errors which hampers the statistical interpretation of χ2 or the maximum log-likelihood. We will briefly discuss these issues in a non-exhaustive practical way. The only and single purpose of the RRF transformation is the sluggish computer power. Therefore during this work GPU (Graphical Processing Units) based fitting was developed which allows to perform real-time full data analysis without RRF. GPUs have become increasingly popular in scientific computing in recent years. Due to their highly parallel architecture they provide the opportunity to accelerate many applications with considerably less costs than upgrading the CPU computational power. With the emergence of frameworks such as CUDA and OpenCL these devices have become more easily programmable. During this work GPU support was added to Musrfit- a data analysis framework for μSR experiments. The new fitting algorithm uses CUDA or OpenCL to offload the most time consuming parts of the calculations to Nvidia or AMD GPUs. Using the current CPU implementation in Musrfit parameter fitting can take hours for certain data sets while the GPU version can allow to perform real-time data analysis on the same data sets. This work describes the challenges that arise in adding the GPU support to t as well as results obtained
Green smartphone GPUs: Optimizing energy consumption using GPUFreq scaling governors

KAUST Repository

Ahmad, Enas M.

2015-10-19

Modern smartphones are limited by their short battery life. The advancement of the graphical performance is considered as one of the main reasons behind the massive battery drainage in smartphones. In this paper we present a novel implementation of the GPUFreq Scaling Governors, a Dynamic Voltage and Frequency Scaling (DVFS) model implemented in the Android Linux kernel for dynamically scaling smartphone Graphical Processing Units (GPUs). The GPUFreq governors offer users multiple variations and alternatives in controlling the power consumption and performance of their GPUs. We implemented and evaluated our model on a smartphone GPU and measured the energy performance using an external power monitor. The results show that the energy consumption of smartphone GPUs can be significantly reduced with a minor effect on the GPU performance.
ECC2K-130 on NVIDIA GPUs

NARCIS (Netherlands)

Bernstein, D.J.; Chen, H.-C.; Cheng, C.M.; Lange, T.; Niederhagen, R.F.; Schwabe, P.; Yang, B.Y.; Gong, G.; Gupta, K.C.

2010-01-01

A major cryptanalytic computation is currently underway on multiple platforms, including standard CPUs, FPGAs, PlayStations and Graphics Processing Units (GPUs), to break the Certicom ECC2K-130 challenge. This challenge is to compute an elliptic-curve discrete logarithm on a Koblitz curve over $\\rm
Heterogeneous Multicore Parallel Programming for Graphics Processing Units

Directory of Open Access Journals (Sweden)

Francois Bodin

2009-01-01

Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.
Multidisciplinary Simulation Acceleration using Multiple Shared-Memory Graphical Processing Units

Science.gov (United States)

Kemal, Jonathan Yashar

For purposes of optimizing and analyzing turbomachinery and other designs, the unsteady Favre-averaged flow-field differential equations for an ideal compressible gas can be solved in conjunction with the heat conduction equation. We solve all equations using the finite-volume multiple-grid numerical technique, with the dual time-step scheme used for unsteady simulations. Our numerical solver code targets CUDA-capable Graphical Processing Units (GPUs) produced by NVIDIA. Making use of MPI, our solver can run across networked compute notes, where each MPI process can use either a GPU or a Central Processing Unit (CPU) core for primary solver calculations. We use NVIDIA Tesla C2050/C2070 GPUs based on the Fermi architecture, and compare our resulting performance against Intel Zeon X5690 CPUs. Solver routines converted to CUDA typically run about 10 times faster on a GPU for sufficiently dense computational grids. We used a conjugate cylinder computational grid and ran a turbulent steady flow simulation using 4 increasingly dense computational grids. Our densest computational grid is divided into 13 blocks each containing 1033x1033 grid points, for a total of 13.87 million grid points or 1.07 million grid points per domain block. To obtain overall speedups, we compare the execution time of the solver's iteration loop, including all resource intensive GPU-related memory copies. Comparing the performance of 8 GPUs to that of 8 CPUs, we obtain an overall speedup of about 6.0 when using our densest computational grid. This amounts to an 8-GPU simulation running about 39.5 times faster than running than a single-CPU simulation.
GPUs for real-time processing in HEP trigger systems

CERN Document Server

Ammendola, R; Deri, L; Fiorini, M; Frezza, O; Lamanna, G; Lo Cicero, F; Lonardo, A; Messina, A; Sozzi, M; Pantaleo, F; Paolucci, Ps; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P

2014-01-01

We describe a pilot project (GAP - GPU Application Project) for the use of GPUs (Graphics processing units) for online triggering applications in High Energy Physics experiments. Two major trends can be identied in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a fully software data selection system (\\trigger-less"). The innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software not only in high level trigger levels but also in early trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several elds of science, although so far applications have been tailored to the specic strengths of such devices as accelerators in oine computation. With the steady reduction of GPU latencies, and the incre...
METRIC context unit architecture

Energy Technology Data Exchange (ETDEWEB)

Simpson, R.O.

1988-01-01

METRIC is an architecture for a simple but powerful Reduced Instruction Set Computer (RISC). Its speed comes from the simultaneous processing of several instruction streams, with instructions from the various streams being dispatched into METRIC's execution pipeline as they become available for execution. The pipeline is thus kept full, with a mix of instructions for several contexts in execution at the same time. True parallel programming is supported within a single execution unit, the METRIC Context Unit. METRIC's architecture provides for expansion through the addition of multiple Context Units and of specialized Functional Units. The architecture thus spans a range of size and performance from a single-chip microcomputer up through large and powerful multiprocessors. This research concentrates on the specification of the METRIC Context Unit at the architectural level. Performance tradeoffs made during METRIC's design are discussed, and projections of METRIC's performance are made based on simulation studies.
Three-directional motion-compensation mask-based novel look-up table on graphics processing units for video-rate generation of digital holographic videos of three-dimensional scenes.

Science.gov (United States)

Kwon, Min-Woo; Kim, Seung-Cheol; Kim, Eun-Soo

2016-01-20

A three-directional motion-compensation mask-based novel look-up table method is proposed and implemented on graphics processing units (GPUs) for video-rate generation of digital holographic videos of three-dimensional (3D) scenes. Since the proposed method is designed to be well matched with the software and memory structures of GPUs, the number of compute-unified-device-architecture kernel function calls can be significantly reduced. This results in a great increase of the computational speed of the proposed method, allowing video-rate generation of the computer-generated hologram (CGH) patterns of 3D scenes. Experimental results reveal that the proposed method can generate 39.8 frames of Fresnel CGH patterns with 1920×1080 pixels per second for the test 3D video scenario with 12,088 object points on dual GPU boards of NVIDIA GTX TITANs, and they confirm the feasibility of the proposed method in the practical application fields of electroholographic 3D displays.
Impact of memory bottleneck on the performance of graphics processing units

Science.gov (United States)

Son, Dong Oh; Choi, Hong Jun; Kim, Jong Myon; Kim, Cheol Hong

2015-12-01

Recent graphics processing units (GPUs) can process general-purpose applications as well as graphics applications with the help of various user-friendly application programming interfaces (APIs) supported by GPU vendors. Unfortunately, utilizing the hardware resource in the GPU efficiently is a challenging problem, since the GPU architecture is totally different to the traditional CPU architecture. To solve this problem, many studies have focused on the techniques for improving the system performance using GPUs. In this work, we analyze the GPU performance varying GPU parameters such as the number of cores and clock frequency. According to our simulations, the GPU performance can be improved by 125.8% and 16.2% on average as the number of cores and clock frequency increase, respectively. However, the performance is saturated when memory bottleneck problems incur due to huge data requests to the memory. The performance of GPUs can be improved as the memory bottleneck is reduced by changing GPU parameters dynamically.

Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization

Energy Technology Data Exchange (ETDEWEB)

Clark, M. A. [NVIDIA Corp., Santa Clara; Joó, Bálint [Jefferson Lab; Strelchenko, Alexei [Fermilab; Cheng, Michael [Boston U., Ctr. Comp. Sci.; Gambhir, Arjun [William-Mary Coll.; Brower, Richard [Boston U.

2016-12-22

The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using multi-grid algorithms, and due to the throughput improvements brought by GPUs. Deploying hierarchical algorithms optimally on GPUs is non-trivial owing to the lack of parallelism on the coarse grids, and as such, these advances have not proved multiplicative. Using the QUDA library, we demonstrate that by exposing all sources of parallelism that the underlying stencil problem possesses, and through appropriate mapping of this parallelism to the GPU architecture, we can achieve high efficiency even for the coarsest of grids. Results are presented for the Wilson-Clover discretization, where we demonstrate up to 10x speedup over present state-of-the-art GPU-accelerated methods on Titan. Finally, we look to the future, and consider the software implications of our findings.
ELASTIC CLOUD COMPUTING ARCHITECTURE AND SYSTEM FOR HETEROGENEOUS SPATIOTEMPORAL COMPUTING

Directory of Open Access Journals (Sweden)

X. Shi

2017-10-01

Full Text Available Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs, while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
Elastic Cloud Computing Architecture and System for Heterogeneous Spatiotemporal Computing

Science.gov (United States)

Shi, X.

2017-10-01

Spatiotemporal computation implements a variety of different algorithms. When big data are involved, desktop computer or standalone application may not be able to complete the computation task due to limited memory and computing power. Now that a variety of hardware accelerators and computing platforms are available to improve the performance of geocomputation, different algorithms may have different behavior on different computing infrastructure and platforms. Some are perfect for implementation on a cluster of graphics processing units (GPUs), while GPUs may not be useful on certain kind of spatiotemporal computation. This is the same situation in utilizing a cluster of Intel's many-integrated-core (MIC) or Xeon Phi, as well as Hadoop or Spark platforms, to handle big spatiotemporal data. Furthermore, considering the energy efficiency requirement in general computation, Field Programmable Gate Array (FPGA) may be a better solution for better energy efficiency when the performance of computation could be similar or better than GPUs and MICs. It is expected that an elastic cloud computing architecture and system that integrates all of GPUs, MICs, and FPGAs could be developed and deployed to support spatiotemporal computing over heterogeneous data types and computational problems.
High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs

Directory of Open Access Journals (Sweden)

Wan Xiaohua

2012-06-01

Full Text Available Abstract Background Three-dimensional (3D reconstruction in electron tomography (ET has emerged as a leading technique to elucidate the molecular structures of complex biological specimens. Blob-based iterative methods are advantageous reconstruction methods for 3D reconstruction in ET, but demand huge computational costs. Multiple graphic processing units (multi-GPUs offer an affordable platform to meet these demands. However, a synchronous communication scheme between multi-GPUs leads to idle GPU time, and a weighted matrix involved in iterative methods cannot be loaded into GPUs especially for large images due to the limited available memory of GPUs. Results In this paper we propose a multilevel parallel strategy combined with an asynchronous communication scheme and a blob-ELLR data structure to efficiently perform blob-based iterative reconstructions on multi-GPUs. The asynchronous communication scheme is used to minimize the idle GPU time so as to asynchronously overlap communications with computations. The blob-ELLR data structure only needs nearly 1/16 of the storage space in comparison with ELLPACK-R (ELLR data structure and yields significant acceleration. Conclusions Experimental results indicate that the multilevel parallel scheme combined with the asynchronous communication scheme and the blob-ELLR data structure allows efficient implementations of 3D reconstruction in ET on multi-GPUs.
Selecting a Benchmark Suite to Profile High-Performance Computing (HPC) Machines

Science.gov (United States)

2014-11-01

architectures. Machines now contain central processing units (CPUs), graphics processing units (GPUs), and many integrated core ( MIC ) architecture all...evaluate the feasibility and applicability of a new architecture just released to the market . Researchers are often unsure how available resources will...architectures. Having a suite of programs running on different architectures, such as GPUs, MICs , and CPUs, adds complexity and technical challenges
Online tracking with GPUs at the PANDA experiment

Energy Technology Data Exchange (ETDEWEB)

Bianchi, Ludovico; Herten, Andreas; Ritman, James; Stockmanns, Tobias [Forschungszentrum Juelich (Germany); Collaboration: PANDA-Collaboration

2015-07-01

The PANDA experiment is a next generation particle detector planned for operation at the FAIR facility, that will study collisions of antiprotons with beam momenta of 1.5-15 GeV/c on a fixed proton target. Signal and background events at PANDA will look very similar, making a conventional hardware-trigger based approach unfeasible. Instead, data coming from the detector are acquired continuously, and event selection is performed in real time. A rejection factor of up to 1000 is needed to reduce the data rate for offline storage, making the data acquisition system computationally very challenging. Our activity within the PANDA collaboration is centered on the development and implementation of particle tracking algorithms on Graphical Processing Units (GPUs), and on studying the possibility of performing tracking for online event filtering using a multi-GPU architecture. Three algorithms are currently being developed, using information from the PANDA tracking system: a Hough Transform, a Riemann Track Finder, and a Triplet Finder algorithm. This talk presents the algorithms, their performance, and studies for GPU data transfer methods based on so-called message queues for a deeper integration of the algorithms with the FairRoot and PandaRoot frameworks.
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

Science.gov (United States)

Mawson, Mark J.; Revell, Alistair J.

2014-10-01

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as 'Kepler'. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of 'performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem.
Resting state networks' corticotopy: the dual intertwined rings architecture.

Directory of Open Access Journals (Sweden)

Salma Mesmoudi

Full Text Available How does the brain integrate multiple sources of information to support normal sensorimotor and cognitive functions? To investigate this question we present an overall brain architecture (called "the dual intertwined rings architecture" that relates the functional specialization of cortical networks to their spatial distribution over the cerebral cortex (or "corticotopy". Recent results suggest that the resting state networks (RSNs are organized into two large families: 1 a sensorimotor family that includes visual, somatic, and auditory areas and 2 a large association family that comprises parietal, temporal, and frontal regions and also includes the default mode network. We used two large databases of resting state fMRI data, from which we extracted 32 robust RSNs. We estimated: (1 the RSN functional roles by using a projection of the results on task based networks (TBNs as referenced in large databases of fMRI activation studies; and (2 relationship of the RSNs with the Brodmann Areas. In both classifications, the 32 RSNs are organized into a remarkable architecture of two intertwined rings per hemisphere and so four rings linked by homotopic connections. The first ring forms a continuous ensemble and includes visual, somatic, and auditory cortices, with interspersed bimodal cortices (auditory-visual, visual-somatic and auditory-somatic, abbreviated as VSA ring. The second ring integrates distant parietal, temporal and frontal regions (PTF ring through a network of association fiber tracts which closes the ring anatomically and ensures a functional continuity within the ring. The PTF ring relates association cortices specialized in attention, language and working memory, to the networks involved in motivation and biological regulation and rhythms. This "dual intertwined architecture" suggests a dual integrative process: the VSA ring performs fast real-time multimodal integration of sensorimotor information whereas the PTF ring performs multi
A Framework for Lattice QCD Calculations on GPUs

Energy Technology Data Exchange (ETDEWEB)

Winter, Frank; Clark, M A; Edwards, Robert G; Joo, Balint

2014-08-01

Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahl's law issue. The lattice QCD application Chroma allows to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallel from the application layer. The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory and Chroma implements algorithms in terms of this high-level interface. Thus by porting the low-level layer one can effectively move the whole application in one swing to a different platform. The QDP-JIT/PTX library, the reimplementation of the low-level layer, provides a framework for lattice QCD calculations for the CUDA architecture. The complete software interface is supported and thus applications can be run unaltered on GPU-based parallel computers. This reimplementation was possible due to the availability of a JIT compiler (part of the NVIDIA Linux kernel driver) which translates an assembly-like language (PTX) to GPU code. The expression template technique is used to build PTX code generators and a software cache manages the GPU memory. This reimplementation allows us to deploy an efficient implementation of the full gauge-generation program with dynamical fermions on large-scale GPU-based machines such as Titan and Blue Waters which accelerates the algorithm by more than an order of magnitude.
The neural architecture of age-related dual-task interferences

Directory of Open Access Journals (Sweden)

Witold Xaver Chmielewski

2014-07-01

Full Text Available In daily life elderly adults exhibit deficits when dual-tasking is involved. So far these deficits have been verified on a behavioral level in dual-tasking. Yet, the neuronal architecture of these deficits in aging still remains to be explored especially when late-middle aged individuals around 60 years of age are concerned. Neuroimaging studies in young participants concerning dual-tasking were, among others, related to activity in middle frontal (MFG and superior frontal gyrus (SFG and the anterior insula (AI. According to the frontal lobe hypothesis of aging, alterations in these frontal regions (i.e., SFG and MFG might be responsible for cognitive deficits. We measured brain activity using fMRI, while examining age-dependent variations in dual-tasking by utilizing the PRP (psychological refractory period test. Behavioral data showed an increasing PRP effect in late-middle aged adults. The results suggest the age-related deteriorated performance in dual-tasking, especially in conditions of risen complexity. These effects are related to changes in networks involving the anterior insula, the SFG and the MFG. The results suggest that different cognitive subprocesses are affected that mediate the observed dual-tasking problems in late-middle aged individuals.
The neural architecture of age-related dual-task interferences.

Science.gov (United States)

Chmielewski, Witold X; Yildiz, Ali; Beste, Christian

2014-01-01

In daily life elderly adults exhibit deficits when dual-tasking is involved. So far these deficits have been verified on a behavioral level in dual-tasking. Yet, the neuronal architecture of these deficits in aging still remains to be explored especially when late-middle aged individuals around 60 years of age are concerned. Neuroimaging studies in young participants concerning dual-tasking were, among others, related to activity in middle frontal (MFG) and superior frontal gyrus (SFG) and the anterior insula (AI). According to the frontal lobe hypothesis of aging, alterations in these frontal regions (i.e., SFG and MFG) might be responsible for cognitive deficits. We measured brain activity using fMRI, while examining age-dependent variations in dual-tasking by utilizing the PRP (psychological refractory period) test. Behavioral data showed an increasing PRP effect in late-middle aged adults. The results suggest the age-related deteriorated performance in dual-tasking, especially in conditions of risen complexity. These effects are related to changes in networks involving the AI, the SFG and the MFG. The results suggest that different cognitive subprocesses are affected that mediate the observed dual-tasking problems in late-middle aged individuals.
Optimization of Monte Carlo algorithms and ray tracing on GPUs

International Nuclear Information System (INIS)

Bergmann, R.M.; Vujic, J.L.

2013-01-01

To take advantage of the computational power of GPUs (Graphical Processing Units), algorithms that work well on CPUs must be modified to conform to the GPU execution model. In this study, typical task-parallel Monte Carlo algorithms have been reformulated in a data-parallel way, and the benefits of doing so are examined. We were able to show that the data-parallel approach greatly improves thread coherency and keeps thread blocks busy, improving GPU utilization compared to the task-parallel approach. Data-parallel does not, however, outperform the task-parallel approach in regards to speedup over CPU. Regarding the ray-tracing acceleration, OptiX shows promise for providing enough ray tracing speed to be used in a full 3D Monte Carlo neutron transport code for reactor calculations. It is important to note that it is necessary to operate on large datasets of particle histories in order to have good performance in both OptiX and the data-parallel algorithm since this reduces the impact of latency. Our paper also shows the need to rewrite standard Monte Carlo algorithms in order to take full advantage of these new, powerful processor architectures
Data Acquisition with GPUs: The DAQ for the Muon $g$-$2$ Experiment at Fermilab

Energy Technology Data Exchange (ETDEWEB)

Gohn, W. [Kentucky U.

2016-11-15

Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be used to effectively write code to parallelize such tasks on Nvidia GPUs, providing a significant upgrade in performance over CPU based acquisition systems. The muon $g$-$2$ experiment at Fermilab is heavily relying on GPUs to process its data. The data acquisition system for this experiment must have the ability to create deadtime-free records from 700 $\\mu$s muon spills at a raw data rate 18 GB per second. Data will be collected using 1296 channels of $\\mu$TCA-based 800 MSPS, 12 bit waveform digitizers and processed in a layered array of networked commodity processors with 24 GPUs working in parallel to perform a fast recording of the muon decays during the spill. The described data acquisition system is currently being constructed, and will be fully operational before the start of the experiment in 2017.
Dual Smarandache Curves of a Timelike Curve lying on Unit dual Lorentzian Sphere

OpenAIRE

Kahraman, Tanju; Hüseyin Ugurlu, Hasan

2016-01-01

In this paper, we give Darboux approximation for dual Smarandache curves of time like curve on unit dual Lorentzian sphere. Firstly, we define the four types of dual Smarandache curves of a timelike curve lying on dual Lorentzian sphere.
Computation Reduction Oriented Circular Scanning SAR Raw Data Simulation on Multi-GPUs

Directory of Open Access Journals (Sweden)

Hu Chen

2016-08-01

Full Text Available As a special working mode, the circular scanning Synthetic Aperture Radar (SAR is widely used in the earth observation. With the increase of resolution and swath width, the simulation data has a massive increase, which boosts the new requirements of efficiency. Through analyzing the redundancy in the raw data simulation based on Graphics Processing Unit (GPU, a fast simulation method considering reduction of redundant computation is realized by the multi-GPUs and Message Passing Interface (MPI. The results show that the efficiency of 4-GPUs increases 2 times through the redundant reduction, and the hardware cost decreases by 50%, thus the overall speedup achieves 350 times than the traditional CPU simulation.
A Weighted Spatial-Spectral Kernel RX Algorithm and Efficient Implementation on GPUs

Directory of Open Access Journals (Sweden)

Chunhui Zhao

2017-02-01

Full Text Available The kernel RX (KRX detector proposed by Kwon and Nasrabadi exploits a kernel function to obtain a better detection performance. However, it still has two limits that can be improved. On the one hand, reasonable integration of spatial-spectral information can be used to further improve its detection accuracy. On the other hand, parallel computing can be used to reduce the processing time in available KRX detectors. Accordingly, this paper presents a novel weighted spatial-spectral kernel RX (WSSKRX detector and its parallel implementation on graphics processing units (GPUs. The WSSKRX utilizes the spatial neighborhood resources to reconstruct the testing pixels by introducing a spectral factor and a spatial window, thereby effectively reducing the interference of background noise. Then, the kernel function is redesigned as a mapping trick in a KRX detector to implement the anomaly detection. In addition, a powerful architecture based on the GPU technique is designed to accelerate WSSKRX. To substantiate the performance of the proposed algorithm, both synthetic and real data are conducted for experiments.
Cross-Identification of Astronomical Catalogs on Multiple GPUs

Science.gov (United States)

Lee, M. A.; Budavári, T.

2013-10-01

One of the most fundamental problems in observational astronomy is the cross-identification of sources. Observations are made in different wavelengths, at different times, and from different locations and instruments, resulting in a large set of independent observations. The scientific outcome is often limited by our ability to quickly perform meaningful associations between detections. The matching, however, is difficult scientifically, statistically, as well as computationally. The former two require detailed physical modeling and advanced probabilistic concepts; the latter is due to the large volumes of data and the problem's combinatorial nature. In order to tackle the computational challenge and to prepare for future surveys, whose measurements will be exponentially increasing in size past the scale of feasible CPU-based solutions, we developed a new implementation which addresses the issue by performing the associations on multiple Graphics Processing Units (GPUs). Our implementation utilizes up to 6 GPUs in combination with the Thrust library to achieve an over 40x speed up verses the previous best implementation running on a multi-CPU SQL Server.
Fast network centrality analysis using GPUs

Directory of Open Access Journals (Sweden)

Shi Zhiao

2011-05-01

Full Text Available Abstract Background With the exploding volume of data generated by continuously evolving high-throughput technologies, biological network analysis problems are growing larger in scale and craving for more computational power. General Purpose computation on Graphics Processing Units (GPGPU provides a cost-effective technology for the study of large-scale biological networks. Designing algorithms that maximize data parallelism is the key in leveraging the power of GPUs. Results We proposed an efficient data parallel formulation of the All-Pairs Shortest Path problem, which is the key component for shortest path-based centrality computation. A betweenness centrality algorithm built upon this formulation was developed and benchmarked against the most recent GPU-based algorithm. Speedup between 11 to 19% was observed in various simulated scale-free networks. We further designed three algorithms based on this core component to compute closeness centrality, eccentricity centrality and stress centrality. To make all these algorithms available to the research community, we developed a software package gpu-fan (GPU-based Fast Analysis of Networks for CUDA enabled GPUs. Speedup of 10-50× compared with CPU implementations was observed for simulated scale-free networks and real world biological networks. Conclusions gpu-fan provides a significant performance improvement for centrality computation in large-scale networks. Source code is available under the GNU Public License (GPL at http://bioinfo.vanderbilt.edu/gpu-fan/.
Massively parallel read mapping on GPUs with the q-group index and PEANUT

NARCIS (Netherlands)

J. Köster (Johannes); S. Rahmann (Sven)

2014-01-01

textabstractWe present the q-group index, a novel data structure for read mapping tailored towards graphics processing units (GPUs) with a small memory footprint and efficient parallel algorithms for querying and building. On top of the q-group index we introduce PEANUT, a highly parallel GPU-based
Analysis of impact of general-purpose graphics processor units in supersonic flow modeling

Science.gov (United States)

Emelyanov, V. N.; Karpenko, A. G.; Kozelkov, A. S.; Teterina, I. V.; Volkov, K. N.; Yalozo, A. V.

2017-06-01

Computational methods are widely used in prediction of complex flowfields associated with off-normal situations in aerospace engineering. Modern graphics processing units (GPU) provide architectures and new programming models that enable to harness their large processing power and to design computational fluid dynamics (CFD) simulations at both high performance and low cost. Possibilities of the use of GPUs for the simulation of external and internal flows on unstructured meshes are discussed. The finite volume method is applied to solve three-dimensional unsteady compressible Euler and Navier-Stokes equations on unstructured meshes with high resolution numerical schemes. CUDA technology is used for programming implementation of parallel computational algorithms. Solutions of some benchmark test cases on GPUs are reported, and the results computed are compared with experimental and computational data. Approaches to optimization of the CFD code related to the use of different types of memory are considered. Speedup of solution on GPUs with respect to the solution on central processor unit (CPU) is compared. Performance measurements show that numerical schemes developed achieve 20-50 speedup on GPU hardware compared to CPU reference implementation. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.

A Fast MHD Code for Gravitationally Stratified Media using Graphical Processing Units: SMAUG

Science.gov (United States)

Griffiths, M. K.; Fedun, V.; Erdélyi, R.

2015-03-01

Parallelization techniques have been exploited most successfully by the gaming/graphics industry with the adoption of graphical processing units (GPUs), possessing hundreds of processor cores. The opportunity has been recognized by the computational sciences and engineering communities, who have recently harnessed successfully the numerical performance of GPUs. For example, parallel magnetohydrodynamic (MHD) algorithms are important for numerical modelling of highly inhomogeneous solar, astrophysical and geophysical plasmas. Here, we describe the implementation of SMAUG, the Sheffield Magnetohydrodynamics Algorithm Using GPUs. SMAUG is a 1-3D MHD code capable of modelling magnetized and gravitationally stratified plasma. The objective of this paper is to present the numerical methods and techniques used for porting the code to this novel and highly parallel compute architecture. The methods employed are justified by the performance benchmarks and validation results demonstrating that the code successfully simulates the physics for a range of test scenarios including a full 3D realistic model of wave propagation in the solar atmosphere.
Accelerating Calculations of Reaction Dissipative Particle Dynamics in LAMMPS

Science.gov (United States)

2017-05-17

HPC) resources and exploit emerging, heterogeneous architectures (e.g., co-processors and graphics processing units [GPUs]), while enabling EM...2 ODE solvers—CVODE* and RKF45—which we previously developed for NVIDIA Compute Unified Device Architecture (CUDA) GPUs.9 The CPU versions of both...nodes. Half of the accelerator nodes (178) have 2 NVIDIA Kepler K40m GPUs and the remaining 178 accelerator nodes have 2 Intel Xeon Phi 7120P co
A Decade-Long European-Scale Convection-Resolving Climate Simulation on GPUs

Science.gov (United States)

Leutwyler, D.; Fuhrer, O.; Ban, N.; Lapillonne, X.; Lüthi, D.; Schar, C.

2016-12-01

Convection-resolving models have proven to be very useful tools in numerical weather prediction and in climate research. However, due to their extremely demanding computational requirements, they have so far been limited to short simulations and/or small computational domains. Innovations in the supercomputing domain have led to new supercomputer designs that involve conventional multi-core CPUs and accelerators such as graphics processing units (GPUs). One of the first atmospheric models that has been fully ported to GPUs is the Consortium for Small-Scale Modeling weather and climate model COSMO. This new version allows us to expand the size of the simulation domain to areas spanning continents and the time period up to one decade. We present results from a decade-long, convection-resolving climate simulation over Europe using the GPU-enabled COSMO version on a computational domain with 1536x1536x60 gridpoints. The simulation is driven by the ERA-interim reanalysis. The results illustrate how the approach allows for the representation of interactions between synoptic-scale and meso-scale atmospheric circulations at scales ranging from 1000 to 10 km. We discuss some of the advantages and prospects from using GPUs, and focus on the performance of the convection-resolving modeling approach on the European scale. Specifically we investigate the organization of convective clouds and on validate hourly rainfall distributions with various high-resolution data sets.
High-Performance Pseudo-Random Number Generation on Graphics Processing Units

OpenAIRE

Nandapalan, Nimalan; Brent, Richard P.; Murray, Lawrence M.; Rendell, Alistair

2011-01-01

This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen algorithm has configurable state size and period, making it ideal for tuning to the GPU architecture. We present a comparison of both speed and statistical quality with other common parallel, GPU-based PRNGs, demonstrating favourable performance o...
Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

International Nuclear Information System (INIS)

Ammendola, Roberto; Biagioni, Andrea; Frezza, Ottorino; Cicero, Francesca Lo; Paolucci, Pier Stanislao; Lonardo, Alessandro; Rossetti, Davide; Simula, Francesco; Tosoratto, Laura; Vicini, Piero

2014-01-01

Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.
Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Energy Technology Data Exchange (ETDEWEB)

Ammendola, Roberto [INFN Sezione Roma Tor Vergata (Italy); Biagioni, Andrea; Frezza, Ottorino; Cicero, Francesca Lo; Paolucci, Pier Stanislao; Lonardo, Alessandro; Rossetti, Davide; Simula, Francesco; Tosoratto, Laura; Vicini, Piero [INFN Sezione Roma (Italy)

2014-06-11

Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.
An FMM based on dual tree traversal for many-core architectures

KAUST Repository

Yokota, Rio

2013-09-01

The present work attempts to integrate the independent efforts in the fast N-body community to create the fastest N-body library for many-core and heterogenous architectures. Focus is placed on low accuracy optimizations, in response to the recent interest to use FMM as a preconditioner for sparse linear solvers. A direct comparison with other state-of-the-art fast N-body codes demonstrates that orders of magnitude increase in performance can be achieved by careful selection of the optimal algorithm and low-level optimization of the code. The current N-body solver uses a fast multipole method with an efficient strategy for finding the list of cell-cell interactions by a dual tree traversal. A task-based threading model is used to maximize thread-level parallelism and intra-node load-balancing. In order to extract the full potential of the SIMD units on the latest CPUs, the inner kernels are optimized using AVX instructions.
Resting State Networks' Corticotopy: The Dual Intertwined Rings Architecture

Science.gov (United States)

Mesmoudi, Salma; Perlbarg, Vincent; Rudrauf, David; Messe, Arnaud; Pinsard, Basile; Hasboun, Dominique; Cioli, Claudia; Marrelec, Guillaume; Toro, Roberto; Benali, Habib; Burnod, Yves

2013-01-01

How does the brain integrate multiple sources of information to support normal sensorimotor and cognitive functions? To investigate this question we present an overall brain architecture (called “the dual intertwined rings architecture”) that relates the functional specialization of cortical networks to their spatial distribution over the cerebral cortex (or “corticotopy”). Recent results suggest that the resting state networks (RSNs) are organized into two large families: 1) a sensorimotor family that includes visual, somatic, and auditory areas and 2) a large association family that comprises parietal, temporal, and frontal regions and also includes the default mode network. We used two large databases of resting state fMRI data, from which we extracted 32 robust RSNs. We estimated: (1) the RSN functional roles by using a projection of the results on task based networks (TBNs) as referenced in large databases of fMRI activation studies; and (2) relationship of the RSNs with the Brodmann Areas. In both classifications, the 32 RSNs are organized into a remarkable architecture of two intertwined rings per hemisphere and so four rings linked by homotopic connections. The first ring forms a continuous ensemble and includes visual, somatic, and auditory cortices, with interspersed bimodal cortices (auditory-visual, visual-somatic and auditory-somatic, abbreviated as VSA ring). The second ring integrates distant parietal, temporal and frontal regions (PTF ring) through a network of association fiber tracts which closes the ring anatomically and ensures a functional continuity within the ring. The PTF ring relates association cortices specialized in attention, language and working memory, to the networks involved in motivation and biological regulation and rhythms. This “dual intertwined architecture” suggests a dual integrative process: the VSA ring performs fast real-time multimodal integration of sensorimotor information whereas the PTF ring performs multi
Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs.

Science.gov (United States)

Lin, Chun-Yuan; Wang, Chung-Hung; Hung, Che-Lun; Lin, Yu-Shiang

2015-01-01

Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n (2)), where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC). The intrinsic time complexity of MCC problem is O(k (2) n (2)) with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.
Fast DRR generation for 2D to 3D registration on GPUs

Energy Technology Data Exchange (ETDEWEB)

Tornai, Gabor Janos; Cserey, Gyoergy [Faculty of Information Technology, Pazmany Peter Catholic University, Prater u. 50/a, H-1083, Budapest (Hungary); Pappas, Ion [General Electric Healthcare, Akron u. 2, H-2040, Budaoers (Hungary)

2012-08-15

Purpose: The generation of digitally reconstructed radiographs (DRRs) is the most time consuming step on the CPU in intensity based two-dimensional x-ray to three-dimensional (CT or 3D rotational x-ray) medical image registration, which has application in several image guided interventions. This work presents optimized DRR rendering on graphical processor units (GPUs) and compares performance achievable on four commercially available devices. Methods: A ray-cast based DRR rendering was implemented for a 512 Multiplication-Sign 512 Multiplication-Sign 72 CT volume. The block size parameter was optimized for four different GPUs for a region of interest (ROI) of 400 Multiplication-Sign 225 pixels with different sampling ratios (1.1%-9.1% and 100%). Performance was statistically evaluated and compared for the four GPUs. The method and the block size dependence were validated on the latest GPU for several parameter settings with a public gold standard dataset (512 Multiplication-Sign 512 Multiplication-Sign 825 CT) for registration purposes. Results: Depending on the GPU, the full ROI is rendered in 2.7-5.2 ms. If sampling ratio of 1.1%-9.1% is applied, execution time is in the range of 0.3-7.3 ms. On all GPUs, the mean of the execution time increased linearly with respect to the number of pixels if sampling was used. Conclusions: The presented results outperform other results from the literature. This indicates that automatic 2D to 3D registration, which typically requires a couple of hundred DRR renderings to converge, can be performed quasi on-line, in less than a second or depending on the application and hardware in less than a couple of seconds. Accordingly, a whole new field of applications is opened for image guided interventions, where the registration is continuously performed to match the real-time x-ray.
Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

Science.gov (United States)

Munekawa, Yuma; Ino, Fumihiko; Hagihara, Kenichi

This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
A Real-Time Early Cognitive Vision System based on a Hybrid coarse and fine grained Parallel Architecture

DEFF Research Database (Denmark)

Jensen, Lars Baunegaard With

. The current top model GPUs from NVIDIA possess up to 240 homogeneous cores. In the past, GPUs have beenhard to program, forcing the programmer to map the algorithm to the graphics processing pipeline and think in terms of vertex and fragment shaders, imposing a limiting factor in the implementation of non......-graphics applications. This, however, has changed with the introduction of the Compute Unified Device Architecture (CUDA) framework from NVIDIA. The EV and ECV stages have different parallel properties. The regular, pixel-based processing of EV fit the GPU architecture very well, and parts of ECV, on the other hand...
Computation of Galois field expressions for quaternary logic functions on GPUs

Directory of Open Access Journals (Sweden)

Gajić Dušan B.

2014-01-01

Full Text Available Galois field (GF expressions are polynomials used as representations of multiple-valued logic (MVL functions. For this purpose, MVL functions are considered as functions defined over a finite (Galois field of order p - GF(p. The problem of computing these functional expressions has an important role in areas such as digital signal processing and logic design. Time needed for computing GF-expressions increases exponentially with the number of variables in MVL functions and, as a result, it often represents a limiting factor in applications. This paper proposes a method for an accelerated computation of GF(4-expressions for quaternary (four-valued logic functions using graphics processing units (GPUs. The method is based on the spectral interpretation of GF-expressions, permitting the use of fast Fourier transform (FFT-like algorithms for their computation. These algorithms are then adapted for highly parallel processing on GPUs. The performance of the proposed solutions is compared with referent C/C++ implementations of the same algorithms processed on central processing units (CPUs. Experimental results confirm that the presented approach leads to significant reduction in processing times (up to 10.86 times when compared to CPU processing. Therefore, the proposed approach widens the set of problem instances which can be efficiently handled in practice. [Projekat Ministarstva nauke Republike Srbije, br. ON174026 i br. III44006
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0

Science.gov (United States)

Fuhrer, Oliver; Chadha, Tarun; Hoefler, Torsten; Kwasniewski, Grzegorz; Lapillonne, Xavier; Leutwyler, David; Lüthi, Daniel; Osuna, Carlos; Schär, Christoph; Schulthess, Thomas C.; Vogt, Hannes

2018-05-01

The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultrahigh-resolution non-hydrostatic climate model for a near-global setup running on the full Piz Daint supercomputer on 4888 GPUs (graphics processing units). The dynamical core of the model has been completely rewritten using a domain-specific language (DSL) for performance portability across different hardware architectures. Physical parameterizations and diagnostics have been ported using compiler directives. To our knowledge this represents the first complete atmospheric model being run entirely on accelerators on this scale. At a grid spacing of 930 m (1.9 km), we achieve a simulation throughput of 0.043 (0.23) simulated years per day and an energy consumption of 596 MWh per simulated year. Furthermore, we propose a new memory usage efficiency (MUE) metric that considers how efficiently the memory bandwidth - the dominant bottleneck of climate codes - is being used.
Optimized batteries for cars with dual electrical architecture

Science.gov (United States)

Douady, J. P.; Pascon, C.; Dugast, A.; Fossati, G.

During recent years, the increase in car electrical equipment has led to many problems with traditional starter batteries (such as cranking failure due to flat batteries, battery cycling etc.). The main causes of these problems are the double function of the automotive battery (starter and service functions) and the difficulties in designing batteries well adapted to these two functions. In order to solve these problems a new concept — the dual-concept — has been developed with two separate batteries: one battery is dedicated to the starter function and the other is dedicated to the service function. Only one alternator charges the two batteries with a separation device between the two electrical circuits. The starter battery is located in the engine compartment while the service battery is located at the rear of the car. From the analysis of new requirements, battery designs have been optimized regarding the two types of functions: (i) a small battery with high specific power for the starting function; for this function a flooded battery with lead-calcium alloy grids and thin plates is proposed; (ii) for the service function, modified sealed gas-recombinant batteries with cycling and deep-discharge ability have been developed. The various advantages of the dual-concept are studied in terms of starting reliability, battery weight, and voltage supply. The operating conditions of the system and several dual electrical architectures have also been studied in the laboratory and the car. The feasibility of the concept is proved.
Data Sorting Using Graphics Processing Units

Directory of Open Access Journals (Sweden)

M. J. Mišić

2012-06-01

Full Text Available Graphics processing units (GPUs have been increasingly used for general-purpose computation in recent years. The GPU accelerated applications are found in both scientific and commercial domains. Sorting is considered as one of the very important operations in many applications, so its efficient implementation is essential for the overall application performance. This paper represents an effort to analyze and evaluate the implementations of the representative sorting algorithms on the graphics processing units. Three sorting algorithms (Quicksort, Merge sort, and Radix sort were evaluated on the Compute Unified Device Architecture (CUDA platform that is used to execute applications on NVIDIA graphics processing units. Algorithms were tested and evaluated using an automated test environment with input datasets of different characteristics. Finally, the results of this analysis are briefly discussed.
Architecture of nuclear power units

International Nuclear Information System (INIS)

Malaniuk, B.

1981-01-01

Nuclear units with circulation cooling using cooling towers are dominating points of the landscape. The individual cooling towers or pairs of cooling towers should be situated in the axes of double units and should also linearly be arranged, rhythmically in the respective zone. Examples are shown of the architectural designs of several nuclear power plants in the USA, the UK, the USSR, France, the FRG and Italy. (H.S.)
Accelerating Multiple Compound Comparison Using LINGO-Based Load-Balancing Strategies on Multi-GPUs

Directory of Open Access Journals (Sweden)

Chun-Yuan Lin

2015-01-01

Full Text Available Compound comparison is an important task for the computational chemistry. By the comparison results, potential inhibitors can be found and then used for the pharmacy experiments. The time complexity of a pairwise compound comparison is O(n2, where n is the maximal length of compounds. In general, the length of compounds is tens to hundreds, and the computation time is small. However, more and more compounds have been synthesized and extracted now, even more than tens of millions. Therefore, it still will be time-consuming when comparing with a large amount of compounds (seen as a multiple compound comparison problem, abbreviated to MCC. The intrinsic time complexity of MCC problem is O(k2n2 with k compounds of maximal length n. In this paper, we propose a GPU-based algorithm for MCC problem, called CUDA-MCC, on single- and multi-GPUs. Four LINGO-based load-balancing strategies are considered in CUDA-MCC in order to accelerate the computation speed among thread blocks on GPUs. CUDA-MCC was implemented by C+OpenMP+CUDA. CUDA-MCC achieved 45 times and 391 times faster than its CPU version on a single NVIDIA Tesla K20m GPU card and a dual-NVIDIA Tesla K20m GPU card, respectively, under the experimental results.
Protein alignment algorithms with an efficient backtracking routine on multiple GPUs

Directory of Open Access Journals (Sweden)

Kierzynka Michal

2011-05-01

Full Text Available Abstract Background Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal Needleman-Wunsch, and Smith-Waterman algorithms with a backtracking procedure which is needed to construct the alignment. Results In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPU-based solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable. Conclusions The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPU-based algorithms can be almost linearly increased when using more than one graphics card.
CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions

Directory of Open Access Journals (Sweden)

Schmidt Bertil

2010-04-01

Full Text Available Abstract Background Due to its high sensitivity, the Smith-Waterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly time-consuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging many-core architectures and their associated programming models. Findings This paper describes the latest release of the CUDASW++ software, CUDASW++ 2.0, which makes new contributions to Smith-Waterman protein database searches using compute unified device architecture (CUDA. A parallel Smith-Waterman algorithm is proposed to further optimize the performance of CUDASW++ 1.0 based on the single instruction, multiple thread (SIMT abstraction. For the first time, we have investigated a partitioned vectorized Smith-Waterman algorithm using CUDA based on the virtualized single instruction, multiple data (SIMD abstraction. The optimized SIMT and the partitioned vectorized algorithms were benchmarked, and remarkably, have similar performance characteristics. CUDASW++ 2.0 achieves performance improvement over CUDASW++ 1.0 as much as 1.74 (1.72 times using the optimized SIMT algorithm and up to 1.77 (1.66 times using the partitioned vectorized algorithm, with a performance of up to 17 (30 billion cells update per second (GCUPS on a single-GPU GeForce GTX 280 (dual-GPU GeForce GTX 295 graphics card. Conclusions CUDASW++ 2.0 is publicly available open-source software, written in CUDA and C++ programming languages. It obtains significant performance improvement over CUDASW++ 1.0 using either the optimized SIMT algorithm or the partitioned vectorized algorithm for Smith-Waterman protein database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.

Electromagnetic Physics Models for Parallel Computing Architectures

Science.gov (United States)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
Iterative Methods for MPC on Graphical Processing Units

DEFF Research Database (Denmark)

Gade-Nielsen, Nicolai Fog; Jørgensen, John Bagterp; Dammann, Bernd

2012-01-01

The high oating point performance and memory bandwidth of Graphical Processing Units (GPUs) makes them ideal for a large number of computations which often arises in scientic computing, such as matrix operations. GPUs achieve this performance by utilizing massive par- allelism, which requires ree...... as to avoid the use of dense matrices, which may be too large for the limited memory capacity of current graphics cards.......The high oating point performance and memory bandwidth of Graphical Processing Units (GPUs) makes them ideal for a large number of computations which often arises in scientic computing, such as matrix operations. GPUs achieve this performance by utilizing massive par- allelism, which requires...
Performance Analysis of FEM Algorithmson GPU and Many-Core Architectures

KAUST Repository

Khurram, Rooh

2015-04-27

The roadmaps of the leading supercomputer manufacturers are based on hybrid systems, which consist of a mix of conventional processors and accelerators. This trend is mainly due to the fact that the power consumption cost of the future cpu-only Exascale systems will be unsustainable, thus accelerators such as graphic processing units (GPUs) and many-integrated-core (MIC) will likely be the integral part of the TOP500 (http://www.top500.org/) supercomputers, beyond 2020. The emerging supercomputer architecture will bring new challenges for the code developers. Continuum mechanics codes will particularly be affected, because the traditional synchronous implicit solvers will probably not scale on hybrid Exascale machines. In the previous study[1], we reported on the performance of a conjugate gradient based mesh motion algorithm[2]on Sandy Bridge, Xeon Phi, and K20c. In the present study we report on the comparative study of finite element codes, using PETSC and AmgX solvers on CPU and GPUs, respectively [3,4]. We believe this study will be a good starting point for FEM code developers, who are contemplating a CPU to accelerator transition.
MPC Toolbox with GPU Accelerated Optimization Algorithms

DEFF Research Database (Denmark)

Gade-Nielsen, Nicolai Fog; Jørgensen, John Bagterp; Dammann, Bernd

2012-01-01

The introduction of Graphical Processing Units (GPUs) in scientific computing has shown great promise in many different fields. While GPUs are capable of very high floating point performance and memory bandwidth, its massively parallel architecture requires algorithms to be reimplemented to suit...
Exploiting graphics processing units for computational biology and bioinformatics.

Science.gov (United States)

Payne, Joshua L; Sinnott-Armstrong, Nicholas A; Moore, Jason H

2010-09-01

Advances in the video gaming industry have led to the production of low-cost, high-performance graphics processing units (GPUs) that possess more memory bandwidth and computational capability than central processing units (CPUs), the standard workhorses of scientific computing. With the recent release of generalpurpose GPUs and NVIDIA's GPU programming language, CUDA, graphics engines are being adopted widely in scientific computing applications, particularly in the fields of computational biology and bioinformatics. The goal of this article is to concisely present an introduction to GPU hardware and programming, aimed at the computational biologist or bioinformaticist. To this end, we discuss the primary differences between GPU and CPU architecture, introduce the basics of the CUDA programming language, and discuss important CUDA programming practices, such as the proper use of coalesced reads, data types, and memory hierarchies. We highlight each of these topics in the context of computing the all-pairs distance between instances in a dataset, a common procedure in numerous disciplines of scientific computing. We conclude with a runtime analysis of the GPU and CPU implementations of the all-pairs distance calculation. We show our final GPU implementation to outperform the CPU implementation by a factor of 1700.
Accelerating cardiac bidomain simulations using graphics processing units.

Science.gov (United States)

Neic, A; Liebmann, M; Hoetzl, E; Mitchell, L; Vigmond, E J; Haase, G; Plank, G

2012-08-01

Anatomically realistic and biophysically detailed multiscale computer models of the heart are playing an increasingly important role in advancing our understanding of integrated cardiac function in health and disease. Such detailed simulations, however, are computationally vastly demanding, which is a limiting factor for a wider adoption of in-silico modeling. While current trends in high-performance computing (HPC) hardware promise to alleviate this problem, exploiting the potential of such architectures remains challenging since strongly scalable algorithms are necessitated to reduce execution times. Alternatively, acceleration technologies such as graphics processing units (GPUs) are being considered. While the potential of GPUs has been demonstrated in various applications, benefits in the context of bidomain simulations where large sparse linear systems have to be solved in parallel with advanced numerical techniques are less clear. In this study, the feasibility of multi-GPU bidomain simulations is demonstrated by running strong scalability benchmarks using a state-of-the-art model of rabbit ventricles. The model is spatially discretized using the finite element methods (FEM) on fully unstructured grids. The GPU code is directly derived from a large pre-existing code, the Cardiac Arrhythmia Research Package (CARP), with very minor perturbation of the code base. Overall, bidomain simulations were sped up by a factor of 11.8 to 16.3 in benchmarks running on 6-20 GPUs compared to the same number of CPU cores. To match the fastest GPU simulation which engaged 20 GPUs, 476 CPU cores were required on a national supercomputing facility.
Electromagnetic Physics Models for Parallel Computing Architectures

International Nuclear Information System (INIS)

Amadio, G; Bianchini, C; Iope, R; Ananya, A; Apostolakis, J; Aurora, A; Bandieramonte, M; Brun, R; Carminati, F; Gheata, A; Gheata, M; Goulas, I; Nikitina, T; Bhattacharyya, A; Mohanty, A; Canal, P; Elvira, D; Jun, S Y; Lima, G; Duhem, L

2016-01-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well. (paper)
Object tracking mask-based NLUT on GPUs for real-time generation of holographic videos of three-dimensional scenes.

Science.gov (United States)

Kwon, M-W; Kim, S-C; Yoon, S-E; Ho, Y-S; Kim, E-S

2015-02-09

A new object tracking mask-based novel-look-up-table (OTM-NLUT) method is proposed and implemented on graphics-processing-units (GPUs) for real-time generation of holographic videos of three-dimensional (3-D) scenes. Since the proposed method is designed to be matched with software and memory structures of the GPU, the number of compute-unified-device-architecture (CUDA) kernel function calls and the computer-generated hologram (CGH) buffer size of the proposed method have been significantly reduced. It therefore results in a great increase of the computational speed of the proposed method and enables real-time generation of CGH patterns of 3-D scenes. Experimental results show that the proposed method can generate 31.1 frames of Fresnel CGH patterns with 1,920 × 1,080 pixels per second, on average, for three test 3-D video scenarios with 12,666 object points on three GPU boards of NVIDIA GTX TITAN, and confirm the feasibility of the proposed method in the practical application of electro-holographic 3-D displays.
Methods to Load Balance a GCR Pressure Solver Using a Stencil Framework on Multi- and Many-Core Architectures

Directory of Open Access Journals (Sweden)

Milosz Ciznicki

2015-01-01

Full Text Available The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimisation techniques to benefit from new high-performance computing machines. Extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present a high-level stencil framework implemented for the EULerian or LAGrangian model (EULAG that efficiently utilises multi- and many-cores architectures. Only an efficient usage of both many-core processors (CPUs and graphics processing units (GPUs with the flexible data decomposition method can lead to the maximum performance that scales the communication-intensive Generalized Conjugate Residual (GCR elliptic solver with preconditioner.
Study on availability of GPU for scientific and engineering calculations

International Nuclear Information System (INIS)

Sakamoto, Kensaku; Kobayashi, Seiji

2009-07-01

Recently, the number of scientific and engineering calculations on GPUs (Graphic Processing Units) is increasing. It is said that GPUs have much higher peak floating-point processing power and memory bandwidth than CPUs (Central Processing Units). We have studied the effectiveness of GPUs by applying them to fundamental scientific and engineering calculations with CUDA (Compute Unified Device Architecture) development tools. The results have shown as follows: 1) Computations on GPUs are effective for such calculations as matrix operation, FFT (Fast Fourier Transform) and CFD (Computational Fluid Dynamics) in nuclear research region. 2) Highly-advanced programming is required for bringing out high performance of GPUs. 3) Double-precision performance is low and ECC (Error Correction Code) in graphic memory systems supports are lacking. (author)
General purpose graphic processing unit implementation of adaptive pulse compression algorithms

Science.gov (United States)

Cai, Jingxiao; Zhang, Yan

2017-07-01

This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.
Real-time capture and reconstruction system with multiple GPUs for a 3D live scene by a generation from 4K IP images to 8K holograms.

Science.gov (United States)

Ichihashi, Yasuyuki; Oi, Ryutaro; Senoh, Takanori; Yamamoto, Kenji; Kurita, Taiichiro

2012-09-10

We developed a real-time capture and reconstruction system for three-dimensional (3D) live scenes. In previous research, we used integral photography (IP) to capture 3D images and then generated holograms from the IP images to implement a real-time reconstruction system. In this paper, we use a 4K (3,840 × 2,160) camera to capture IP images and 8K (7,680 × 4,320) liquid crystal display (LCD) panels for the reconstruction of holograms. We investigate two methods for enlarging the 4K images that were captured by integral photography to 8K images. One of the methods increases the number of pixels of each elemental image. The other increases the number of elemental images. In addition, we developed a personal computer (PC) cluster system with graphics processing units (GPUs) for the enlargement of IP images and the generation of holograms from the IP images using fast Fourier transform (FFT). We used the Compute Unified Device Architecture (CUDA) as the development environment for the GPUs. The Fast Fourier transform is performed using the CUFFT (CUDA FFT) library. As a result, we developed an integrated system for performing all processing from the capture to the reconstruction of 3D images by using these components and successfully used this system to reconstruct a 3D live scene at 12 frames per second.
End-to-end plasma bubble PIC simulations on GPUs

Science.gov (United States)

Germaschewski, Kai; Fox, William; Matteucci, Jackson; Bhattacharjee, Amitava

2017-10-01

Accelerator technologies play a crucial role in eventually achieving exascale computing capabilities. The current and upcoming leadership machines at ORNL (Titan and Summit) employ Nvidia GPUs, which provide vast computational power but also need specifically adapted computational kernels to fully exploit them. In this work, we will show end-to-end particle-in-cell simulations of the formation, evolution and coalescence of laser-generated plasma bubbles. This work showcases the GPU capabilities of the PSC particle-in-cell code, which has been adapted for this problem to support particle injection, a heating operator and a collision operator on GPUs.
In-situ characterization of symmetric dual-pass architecture of microfluidic co-laminar flow cells

International Nuclear Information System (INIS)

Ibrahim, Omar A.; Goulet, Marc-Antoni; Kjeang, Erik

2016-01-01

Highlights: • An analytical cell design is proposed for characterization of dual-pass flow cells • High power density up to 0.75 W cm −2 is demonstrated • The performance contributions of the inlet and outlet passes are of the same order • Downstream crossover is analyzed as a function of cell current and flow rate - Abstract: Microfluidic co-laminar flow cells with dual-pass architecture enable fuel recirculation and in-situ regeneration, and offer improvements in performance characteristics. In this work, a unique analytical cell design is proposed, with two split portions having flow-through porous electrodes. Each cell portion is first tested individually with vanadium redox species and the results are used to quantify the previously unknown crossover losses at the downstream portion of the cell, shown here to be a strong function of the flow rate. Moreover, the upstream cell portion demonstrates impressive room-temperature power density up to 0.75 W cm −2 at 1.0 A cm −2 , which is the highest performance reported to date for a microfluidic vanadium redox battery. Next, the two cell portions are connected in parallel to resemble a complete cell with dual-pass architecture, thereby enabling novel in-situ diagnostics of the inlet and outlet passes of the cell. For instance, the reactant utilization efficiency of the downstream cell portion is shown to be on the same order as that of the upstream portion at both low and high flow rates. Furthermore, in-situ regeneration is also demonstrated. Overall, the present results provide a deeper understanding of dual-pass reactant conversion and crossover which will be useful for future device optimization.
GPUs for real-time processing in HEP trigger systems (CHEP2013: 20. international conference on computing in high energy and nuclear physics)

Energy Technology Data Exchange (ETDEWEB)

Lamanna, G; Lamanna, G; Piandani, R [INFN, Pisa (Italy); Ammendola, R [INFN, Rome " Tor Vergata" (Italy); Bauce, M; Giagu, S; Messina, A [University, Rome " Sapienza" (Italy); Biagioni, A; Lonardo, A; Paolucci, P S; Rescigno, M; Simula, F; Vicini, P [INFN, Rome " Sapienza" (Italy); Fantechi, R [CERN, Geneve (Switzerland); Fiorini, M [University and INFN, Ferrara (Italy); Graverini, E; Pantaleo, F; Sozzi, M [University, Pisa (Italy)

2014-06-11

We describe a pilot project for the use of Graphics Processing Units (GPUs) for online triggering applications in High Energy Physics (HEP) experiments. Two major trends can be identified in the development of trigger and DAQ systems for HEP experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a pure software selection system (trigger-less). The very innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software both at low- and high-level trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming very attractive. We discuss in details the use of online parallel computing on GPUs for synchronous low-level trigger with fixed latency. In particular we show preliminary results on a first test in the NA62 experiment at CERN. The use of GPUs in high-level triggers is also considered, the ATLAS experiment (and in particular the muon trigger) at CERN will be taken as a study case of possible applications.
GPUs for real-time processing in HEP trigger systems (CHEP2013: 20. international conference on computing in high energy and nuclear physics)

International Nuclear Information System (INIS)

Lamanna, G; Lamanna, G; Piandani, R; Tor Vergata (Italy))" data-affiliation=" (INFN, Rome Tor Vergata (Italy))" >Ammendola, R; Sapienza (Italy))" data-affiliation=" (University, Rome Sapienza (Italy))" >Bauce, M; Sapienza (Italy))" data-affiliation=" (University, Rome Sapienza (Italy))" >Giagu, S; Sapienza (Italy))" data-affiliation=" (University, Rome Sapienza (Italy))" >Messina, A; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Biagioni, A; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Lonardo, A; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Paolucci, P S; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Rescigno, M; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Simula, F; Sapienza (Italy))" data-affiliation=" (INFN, Rome Sapienza (Italy))" >Vicini, P; Fantechi, R; Fiorini, M; Graverini, E; Pantaleo, F; Sozzi, M

2014-01-01

We describe a pilot project for the use of Graphics Processing Units (GPUs) for online triggering applications in High Energy Physics (HEP) experiments. Two major trends can be identified in the development of trigger and DAQ systems for HEP experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a pure software selection system (trigger-less). The very innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software both at low- and high-level trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming very attractive. We discuss in details the use of online parallel computing on GPUs for synchronous low-level trigger with fixed latency. In particular we show preliminary results on a first test in the NA62 experiment at CERN. The use of GPUs in high-level triggers is also considered, the ATLAS experiment (and in particular the muon trigger) at CERN will be taken as a study case of possible applications.
Marginally Stable Triangular Recurrent Neural Network Architecture for Time Series Prediction.

Science.gov (United States)

Sivakumar, Seshadri; Sivakumar, Shyamala

2017-09-25

This paper introduces a discrete-time recurrent neural network architecture using triangular feedback weight matrices that allows a simplified approach to ensuring network and training stability. The triangular structure of the weight matrices is exploited to readily ensure that the eigenvalues of the feedback weight matrix represented by the block diagonal elements lie on the unit circle in the complex z-plane by updating these weights based on the differential of the angular error variable. Such placement of the eigenvalues together with the extended close interaction between state variables facilitated by the nondiagonal triangular elements, enhances the learning ability of the proposed architecture. Simulation results show that the proposed architecture is highly effective in time-series prediction tasks associated with nonlinear and chaotic dynamic systems with underlying oscillatory modes. This modular architecture with dual upper and lower triangular feedback weight matrices mimics fully recurrent network architectures, while maintaining learning stability with a simplified training process. While training, the block-diagonal weights (hence the eigenvalues) of the dual triangular matrices are constrained to the same values during weight updates aimed at minimizing the possibility of overfitting. The dual triangular architecture also exploits the benefit of parsing the input and selectively applying the parsed inputs to the two subnetworks to facilitate enhanced learning performance.
Kalman filter tracking on parallel architectures

Science.gov (United States)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project.

Science.gov (United States)

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L

2012-06-13

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

Science.gov (United States)

Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.

2012-06-01

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

International Nuclear Information System (INIS)

Andrade, Xavier; Aspuru-Guzik, Alán; Alberdi-Rodriguez, Joseba; Rubio, Angel; Strubbe, David A; Louie, Steven G; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Marques, Miguel A L

2012-01-01

Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures. (topical review)
Graphics processing unit based computation for NDE applications

Science.gov (United States)

Nahas, C. A.; Rajagopal, Prabhu; Balasubramaniam, Krishnan; Krishnamurthy, C. V.

2012-05-01

Advances in parallel processing in recent years are helping to improve the cost of numerical simulation. Breakthroughs in Graphical Processing Unit (GPU) based computation now offer the prospect of further drastic improvements. The introduction of 'compute unified device architecture' (CUDA) by NVIDIA (the global technology company based in Santa Clara, California, USA) has made programming GPUs for general purpose computing accessible to the average programmer. Here we use CUDA to develop parallel finite difference schemes as applicable to two problems of interest to NDE community, namely heat diffusion and elastic wave propagation. The implementations are for two-dimensions. Performance improvement of the GPU implementation against serial CPU implementation is then discussed.
GPUs for real-time processing in HEP trigger systems (ACAT2013: 15. international workshop on advanced computing and analysis techniques in physics research)

International Nuclear Information System (INIS)

Ammendola, R; Biagioni, A; Frezza, O; Cicero, F Lo; Lonardo, A; Messina, A; Paolucci, PS; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P; Deri, L; Sozzi, M; Pantaleo, F; Fiorini, M; Lamanna, G

2014-01-01

We describe a pilot project (GAP – GPU Application Project) for the use of GPUs (Graphics processing units) for online triggering applications in High Energy Physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a fully software data selection system ( t rigger-less ) . The innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software not only in high level trigger levels but also in early trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerators in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high energy physics data acquisition and trigger systems is becoming relevant. We discuss in detail the use of online parallel computing on GPUs for synchronous low-level triggers with fixed latency. In particular we show preliminary results on a first test in the CERN NA62 experiment. The use of GPUs in high level triggers is also considered, the CERN ATLAS experiment being taken as a case study of possible applications
Directions in parallel processor architecture, and GPUs too

CERN Multimedia

CERN. Geneva

2014-01-01

Modern computing is power-limited in every domain of computing. Performance increments extracted from instruction-level parallelism (ILP) are no longer power-efficient; they haven't been for some time. Thread-level parallelism (TLP) is a more easily exploited form of parallelism, at the expense of programmer effort to expose it in the program. In this talk, I will introduce you to disparate topics in parallel processor architecture that will impact programming models (and you) in both the near and far future. About the speaker Olivier is a senior GPU (SM) architect at NVIDIA and an active participant in the concurrency working group of the ISO C++ committee. He has also worked on very large diesel engines as a mechanical engineer, and taught at McGill University (Canada) as a faculty instructor.
Optimization Solutions for Improving the Performance of the Parallel Reduction Algorithm Using Graphics Processing Units

Directory of Open Access Journals (Sweden)

Ion LUNGU

2012-01-01

Full Text Available In this paper, we research, analyze and develop optimization solutions for the parallel reduction function using graphics processing units (GPUs that implement the Compute Unified Device Architecture (CUDA, a modern and novel approach for improving the software performance of data processing applications and algorithms. Many of these applications and algorithms make use of the reduction function in their computational steps. After having designed the function and its algorithmic steps in CUDA, we have progressively developed and implemented optimization solutions for the reduction function. In order to confirm, test and evaluate the solutions' efficiency, we have developed a custom tailored benchmark suite. We have analyzed the obtained experimental results regarding: the comparison of the execution time and bandwidth when using graphic processing units covering the main CUDA architectures (Tesla GT200, Fermi GF100, Kepler GK104 and a central processing unit; the data type influence; the binary operator's influence.
Second unit scheduling concerns on a dual-unit nuclear project

International Nuclear Information System (INIS)

Block, H.R.; Mazzini, R.A.

1978-01-01

This paper explores the planning and scheduling problems of Unit 2 of the Susquehanna steam electric station. The causes of these problems and methods to avoid or mitigate their consequences are discussed. The Susquehanna steam electric station has two boiling water reactors rated at 1,100 MW each. Topics considered include cost factors, structures, equipment, engineering and home office, construction services, completion data phasing, work sequencing, structural dependences, and segregation. Substatial cost and schedule benefits can result if two nuclear units are designed and constructed as one integral station, and if maximum sharing of facilities and services between the units occurs. It is concluded that the cost benefits of highly integrated dual unit construction outweigh the schedule and logistical problems caused by that approach
ECC2K-130 on NVIDIA GPUs

NARCIS (Netherlands)

Bernstein, D.J.; Chen, H.-C.; Cheng, C.M.; Lange, T.; Niederhagen, R.F.; Schwabe, P.; Yang, B.Y.

2012-01-01

[Updated version of paper at Indocrypt 2010] A major cryptanalytic computation is currently underway on multiple platforms, including standard CPUs, FPGAs, PlayStations and GPUs, to break the Certicom ECC2K-130 challenge. This challenge is to compute an elliptic-curve discrete logarithm on a Koblitz
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0

Directory of Open Access Journals (Sweden)

O. Fuhrer

2018-05-01

Full Text Available The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultrahigh-resolution non-hydrostatic climate model for a near-global setup running on the full Piz Daint supercomputer on 4888 GPUs (graphics processing units. The dynamical core of the model has been completely rewritten using a domain-specific language (DSL for performance portability across different hardware architectures. Physical parameterizations and diagnostics have been ported using compiler directives. To our knowledge this represents the first complete atmospheric model being run entirely on accelerators on this scale. At a grid spacing of 930 m (1.9 km, we achieve a simulation throughput of 0.043 (0.23 simulated years per day and an energy consumption of 596 MWh per simulated year. Furthermore, we propose a new memory usage efficiency (MUE metric that considers how efficiently the memory bandwidth – the dominant bottleneck of climate codes – is being used.
Fast in-memory elastic full-waveform inversion using consumer-grade GPUs

Science.gov (United States)

Sivertsen Bergslid, Tore; Birger Raknes, Espen; Arntsen, Børge

2017-04-01

Full-waveform inversion (FWI) is a technique to estimate subsurface properties by using the recorded waveform produced by a seismic source and applying inverse theory. This is done through an iterative optimization procedure, where each iteration requires solving the wave equation many times, then trying to minimize the difference between the modeled and the measured seismic data. Having to model many of these seismic sources per iteration means that this is a highly computationally demanding procedure, which usually involves writing a lot of data to disk. We have written code that does forward modeling and inversion entirely in memory. A typical HPC cluster has many more CPUs than GPUs. Since FWI involves modeling many seismic sources per iteration, the obvious approach is to parallelize the code on a source-by-source basis, where each core of the CPU performs one modeling, and do all modelings simultaneously. With this approach, the GPU is already at a major disadvantage in pure numbers. Fortunately, GPUs can more than make up for this hardware disadvantage by performing each modeling much faster than a CPU. Another benefit of parallelizing each individual modeling is that it lets each modeling use a lot more RAM. If one node has 128 GB of RAM and 20 CPU cores, each modeling can use only 6.4 GB RAM if one is running the node at full capacity with source-by-source parallelization on the CPU. A parallelized per-source code using GPUs can use 64 GB RAM per modeling. Whenever a modeling uses more RAM than is available and has to start using regular disk space the runtime increases dramatically, due to slow file I/O. The extremely high computational speed of the GPUs combined with the large amount of RAM available for each modeling lets us do high frequency FWI for fairly large models very quickly. For a single modeling, our GPU code outperforms the single-threaded CPU-code by a factor of about 75. Successful inversions have been run on data with frequencies up to 40
GPUs for real-time processing in HEP trigger systems (ACAT2013: 15. international workshop on advanced computing and analysis techniques in physics research)

Energy Technology Data Exchange (ETDEWEB)

Ammendola, R; Biagioni, A; Frezza, O; Cicero, F Lo; Lonardo, A; Messina, A; Paolucci, PS; Rossetti, D; Simula, F; Tosoratto, L; Vicini, P [INFN Roma,P.le A.Moro,2, 00185 Roma (Italy); Deri, L; Sozzi, M; Pantaleo, F [Pisa University, Largo B.Pontecorvo,3, 56127 Pisa (Italy); Fiorini, M [Ferrara University, Via Saragat,1, 44122 Ferrara (Italy); Lamanna, G [INFN Pisa, laro B.Pontecorvo,3, 56127 Pisa (Italy); Collaboration: GAP Collaboration

2014-06-06

We describe a pilot project (GAP – GPU Application Project) for the use of GPUs (Graphics processing units) for online triggering applications in High Energy Physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a fully software data selection system ({sup t}rigger-less{sup )}. The innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software not only in high level trigger levels but also in early trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerators in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high energy physics data acquisition and trigger systems is becoming relevant. We discuss in detail the use of online parallel computing on GPUs for synchronous low-level triggers with fixed latency. In particular we show preliminary results on a first test in the CERN NA62 experiment. The use of GPUs in high level triggers is also considered, the CERN ATLAS experiment being taken as a case study of possible applications.
TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

Directory of Open Access Journals (Sweden)

Carlos Couder-Castañeda

2013-01-01

Full Text Available An implementation with the CUDA technology in a single and in several graphics processing units (GPUs is presented for the calculation of the forward modeling of gravitational fields from a tridimensional volumetric ensemble composed by unitary prisms of constant density. We compared the performance results obtained with the GPUs against a previous version coded in OpenMP with MPI, and we analyzed the results on both platforms. Today, the use of GPUs represents a breakthrough in parallel computing, which has led to the development of several applications with various applications. Nevertheless, in some applications the decomposition of the tasks is not trivial, as can be appreciated in this paper. Unlike a trivial decomposition of the domain, we proposed to decompose the problem by sets of prisms and use different memory spaces per processing CUDA core, avoiding the performance decay as a result of the constant calls to kernels functions which would be needed in a parallelization by observations points. The design and implementation created are the main contributions of this work, because the parallelization scheme implemented is not trivial. The performance results obtained are comparable to those of a small processing cluster.
Communication: A reduced scaling J-engine based reformulation of SOS-MP2 using graphics processing units

Energy Technology Data Exchange (ETDEWEB)

Maurer, S. A.; Kussmann, J.; Ochsenfeld, C., E-mail: Christian.Ochsenfeld@cup.uni-muenchen.de [Chair of Theoretical Chemistry, Department of Chemistry, University of Munich (LMU), Butenandtstr. 7, D-81377 München (Germany); Center for Integrated Protein Science (CIPSM) at the Department of Chemistry, University of Munich (LMU), Butenandtstr. 5–13, D-81377 München (Germany)

2014-08-07

We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N{sup 5}) to O(N{sup 3}) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.
Communication: A reduced scaling J-engine based reformulation of SOS-MP2 using graphics processing units.

Science.gov (United States)

Maurer, S A; Kussmann, J; Ochsenfeld, C

2014-08-07

We present a low-prefactor, cubically scaling scaled-opposite-spin second-order Møller-Plesset perturbation theory (SOS-MP2) method which is highly suitable for massively parallel architectures like graphics processing units (GPU). The scaling is reduced from O(N⁵) to O(N³) by a reformulation of the MP2-expression in the atomic orbital basis via Laplace transformation and the resolution-of-the-identity (RI) approximation of the integrals in combination with efficient sparse algebra for the 3-center integral transformation. In contrast to previous works that employ GPUs for post Hartree-Fock calculations, we do not simply employ GPU-based linear algebra libraries to accelerate the conventional algorithm. Instead, our reformulation allows to replace the rate-determining contraction step with a modified J-engine algorithm, that has been proven to be highly efficient on GPUs. Thus, our SOS-MP2 scheme enables us to treat large molecular systems in an accurate and efficient manner on a single GPU-server.
Efficient Synchronization Primitives for GPUs

OpenAIRE

Stuart, Jeff A.; Owens, John D.

2011-01-01

In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla...
Performance comparison of OpenCL and CUDA by benchmarking an optimized perspective backprojection

Energy Technology Data Exchange (ETDEWEB)

Swall, Stefan; Ritschl, Ludwig; Knaup, Michael; Kachelriess, Marc [Erlangen-Nuernberg Univ., Erlangen (Germany). Inst. of Medical Physics (IMP)

2011-07-01

The increase in performance of Graphical Processing Units (GPUs) and the onward development of dedicated software tools within the last decade allows to transfer performance-demanding computations from the Central Processing Unit (CPU) to the GPU and to speed up certain tasks by utilizing the massiv parallel architecture of these devices. The Computate Unified Device Architecture (CUDA) developed by NVIDIA provides an easy hence effective way to develop application that target NVIDIA GPUs. It has become one of the cardinal software tools for this purpose. Recently the Open Computing Language (OpenCL) became available that is neither vendor-specific nor limited to GPUs only. As the benefits of CUDA-based image reconstruction are well known we aim at providing a comparison between the performance that can be achieved with CUDA in comparison to OpenCL by benchmarking the time required to perform a simple but computationally demanding task: the perspective backprojection. (orig.)
GPUs for fast pattern matching in the RICH of the NA62 experiment

CERN Document Server

Lamanna, G; Sozzi, M

2011-01-01

In rare decays experiments an effective online selection is a fundamental part of the data acquisition system (DAQ) in order to reduce both the quantity of data written on tape and the bandwidth requirements for the DAQ system. A multilevel architecture is commonly used to achieve a higher reduction factor, exploiting dedicated custom hardware and flexible software in standard computers. In this paper we discuss the possibility to use commercial video card processors (GPU) to build a fast and effective trigger system, both at hardware and software level. The computing power of the GPUs allows to design a real-time system in which trigger decisions are taken directly in the video processor with a defined maximum latency. This allows building lowest trigger levels based on standard off-the-shelf PCs with CPU and GPU (instead of the commonly adopted solutions based on custom electronics with FPGA or ASICs) with enhanced and high performance computation capabilities, resulting in high rejection power, high effici...
N-body simulation for self-gravitating collisional systems with a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions

Science.gov (United States)

Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo

2012-02-01

We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
Energy Efficient Smartphones: Minimizing the Energy Consumption of Smartphone GPUs using DVFS Governors

KAUST Repository

Ahmad, Enas M.

2013-05-15

Modern smartphones are being designed with increasing processing power, memory capacity, network communication, and graphics performance. Although all of these features are enriching and expanding the experience of a smartphone user, they are significantly adding an overhead on the limited energy of the battery. This thesis aims at enhancing the energy efficiency of modern smartphones and increasing their battery life by minimizing the energy consumption of smartphones Graphical Processing Unit (GPU). Smartphone operating systems are becoming fully hardware-accelerated, which implies relying on the GPU power for rendering all application graphics. In addition, the GPUs installed in smartphones are becoming more and more powerful by the day. This raises an energy consumption concern. We present a novel implementation of GPU Scaling Governors, a Dynamic Voltage and Frequency Scaling (DVFS) scheme implemented in the Android kernel to dynamically scale the GPU. The scheme includes four main governors: Performance, Powersave, Ondmand, and Conservative. Unlike previous studies which looked into the power efficiency of mobile GPUs only through simulation and power estimations, we have implemented our approach on a real modern smartphone GPU, and acquired actual energy measurements using an external power monitor. Our results show that the energy consumption of smartphones can be reduced up to 15% using the Conservative governor in 2D rendering mode, and up to 9% in 3D rendering mode, with minimal effect on the performance.
Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Energy Technology Data Exchange (ETDEWEB)

Arumugam, Kamesh [Old Dominion Univ., Norfolk, VA (United States)

2017-05-01

the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-ow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization.
A Triply Selective MIMO Channel Simulator Using GPUs

Directory of Open Access Journals (Sweden)

R. Carrasco-Alvarez

2018-01-01

Full Text Available A methodology for implementing a triply selective multiple-input multiple-output (MIMO simulator based on graphics processing units (GPUs is presented. The resulting simulator is based on the implementation of multiple double-selective single-input single-output (SISO channel generators, where the multiple inputs and the multiple received signals have been transformed in order to supply the corresponding space correlation of the channel under consideration. A direct consequence of this approach is the flexibility provided, which allows different propagation statistics to each SISO channel to be specified and thus more complex environments to be replicated. It is shown that under some specific constraints, the statistics of the triply selective MIMO simulator are the same as those reported in the state of art. Simulation results show the computational time improvement achieved, up to 650-fold for an 8 × 8 MIMO channel simulator when compared with sequential implementations. In addition to the computational improvement, the proposed simulator offers flexibility for testing a variety of scenarios in vehicle-to-vehicle (V2V and vehicle-to-infrastructure (V2I systems.

A New Data Layout For Set Intersection on GPUs

DEFF Research Database (Denmark)

Amossen, Rasmus Resen; Pagh, Rasmus

2011-01-01

. However, GPUs require highly regular control flow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, BATMAP...
Decryption-decompression of AES protected ZIP files on GPUs

Science.gov (United States)

Duong, Tan Nhat; Pham, Phong Hong; Nguyen, Duc Huu; Nguyen, Thuy Thanh; Le, Hung Duc

2011-10-01

AES is a strong encryption system, so decryption-decompression of AES encrypted ZIP files requires very large computing power and techniques of reducing the password space. This makes implementations of techniques on common computing system not practical. In [1], we reduced the original very large password search space to a much smaller one which surely containing the correct password. Based on reduced set of passwords, in this paper, we parallel decryption, decompression and plain text recognition for encrypted ZIP files by using CUDA computing technology on graphics cards GeForce GTX295 of NVIDIA, to find out the correct password. The experimental results have shown that the speed of decrypting, decompressing, recognizing plain text and finding out the original password increases about from 45 to 180 times (depends on the number of GPUs) compared to sequential execution on the Intel Core 2 Quad Q8400 2.66 GHz. These results have demonstrated the potential applicability of GPUs in this cryptanalysis field.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures

Science.gov (United States)

2017-10-04

to the memory architectures of CPUs and GPUs to obtain good performance and result in good memory performance using cache management. These methods ...Accomplishments: The PI and students has developed new methods for path and ray tracing and their Report Date: 14-Oct-2017 INVESTIGATOR(S): Phone...The efficiency of our method makes it a good candidate for forming hybrid schemes with wave-based models. One possibility is to couple the ray curve
Batched Tile Low-Rank GEMM on GPUs

KAUST Repository

Charara, Ali

2018-02-01

Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for most of the scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract the hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retains the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that it can exploit the data sparsity structure of the matrix while leveraging the underlying TLR compression format. The main idea consists in aggregating all operations onto a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advance algorithms.
Instruction Set Architectures for Quantum Processing Units

OpenAIRE

Britt, Keith A.; Humble, Travis S.

2017-01-01

Progress in quantum computing hardware raises questions about how these devices can be controlled, programmed, and integrated with existing computational workflows. We briefly describe several prominent quantum computational models, their associated quantum processing units (QPUs), and the adoption of these devices as accelerators within high-performance computing systems. Emphasizing the interface to the QPU, we analyze instruction set architectures based on reduced and complex instruction s...
Heterogeneous computing architecture for fast detection of SNP-SNP interactions.

Science.gov (United States)

Sluga, Davor; Curk, Tomaz; Zupan, Blaz; Lotric, Uros

2014-06-25

The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Graph Processing on GPUs: A Survey

DEFF Research Database (Denmark)

Shi, Xuanhua; Zheng, Zhigao; Zhou, Yongluan

2018-01-01

hundreds of billions, has attracted much attention in both industry and academia. It still remains a great challenge to process such large-scale graphs. Researchers have been seeking for new possible solutions. Because of the massive degree of parallelism and the high memory access bandwidth in GPU......, utilizing GPU to accelerate graph processing proves to be a promising solution. This article surveys the key issues of graph processing on GPUs, including data layout, memory access pattern, workload mapping, and specific GPU programming. In this article, we summarize the state-of-the-art research on GPU...
Green smartphone GPUs: Optimizing energy consumption using GPUFreq scaling governors

KAUST Repository

Ahmad, Enas M.; Shihada, Basem

2015-01-01

and alternatives in controlling the power consumption and performance of their GPUs. We implemented and evaluated our model on a smartphone GPU and measured the energy performance using an external power monitor. The results show that the energy consumption
Open problems in CEM: Porting an explicit time-domain volume-integral- equation solver on GPUs with OpenACC

KAUST Repository

Ergül, Özgür

2014-04-01

Graphics processing units (GPUs) are gradually becoming mainstream in high-performance computing, as their capabilities for enhancing performance of a large spectrum of scientific applications to many fold when compared to multi-core CPUs have been clearly identified and proven. In this paper, implementation and performance-tuning details for porting an explicit marching-on-in-time (MOT)-based time-domain volume-integral-equation (TDVIE) solver onto GPUs are described in detail. To this end, a high-level approach, utilizing the OpenACC directive-based parallel programming model, is used to minimize two often-faced challenges in GPU programming: developer productivity and code portability. The MOT-TDVIE solver code, originally developed for CPUs, is annotated with compiler directives to port it to GPUs in a fashion similar to how OpenMP targets multi-core CPUs. In contrast to CUDA and OpenCL, where significant modifications to CPU-based codes are required, this high-level approach therefore requires minimal changes to the codes. In this work, we make use of two available OpenACC compilers, CAPS and PGI. Our experience reveals that different annotations of the code are required for each of the compilers, due to different interpretations of the fairly new standard by the compiler developers. Both versions of the OpenACC accelerated code achieved significant performance improvements, with up to 30× speedup against the sequential CPU code using recent hardware technology. Moreover, we demonstrated that the GPU-accelerated fully explicit MOT-TDVIE solver leveraged energy-consumption gains of the order of 3× against its CPU counterpart. © 2014 IEEE.
Development of Desktop Computing Applications and Engineering Tools on GPUs

DEFF Research Database (Denmark)

Sørensen, Hans Henrik Brandenborg; Glimberg, Stefan Lemvig; Hansen, Toke Jansen

(GPUs) for high-performance computing applications and software tools in science and engineering, inverse problems, visualization, imaging, dynamic optimization. The goals are to contribute to the development of new state-of-the-art mathematical models and algorithms for maximum throughout performance...
Cognitive Architectures for Multimedia Learning

Science.gov (United States)

Reed, Stephen K.

2006-01-01

This article provides a tutorial overview of cognitive architectures that can form a theoretical foundation for designing multimedia instruction. Cognitive architectures include a description of memory stores, memory codes, and cognitive operations. Architectures that are relevant to multimedia learning include Paivio's dual coding theory,…
Nonword Reading: Comparing Dual-Route Cascaded and Connectionist Dual-Process Models with Human Data

Science.gov (United States)

Pritchard, Stephen C.; Coltheart, Max; Palethorpe, Sallyanne; Castles, Anne

2012-01-01

Two prominent dual-route computational models of reading aloud are the dual-route cascaded (DRC) model, and the connectionist dual-process plus (CDP+) model. While sharing similarly designed lexical routes, the two models differ greatly in their respective nonlexical route architecture, such that they often differ on nonword pronunciation. Neither…
Performance Analysis of FEM Algorithmson GPU and Many-Core Architectures

KAUST Repository

Khurram, Rooh; Kortas, Samuel

2015-01-01

-only Exascale systems will be unsustainable, thus accelerators such as graphic processing units (GPUs) and many-integrated-core (MIC) will likely be the integral part of the TOP500 (http://www.top500.org/) supercomputers, beyond 2020. The emerging supercomputer
Porting plasma physics simulation codes to modern computing architectures using the libmrc framework

Science.gov (United States)

Germaschewski, Kai; Abbott, Stephen

2015-11-01

Available computing power has continued to grow exponentially even after single-core performance satured in the last decade. The increase has since been driven by more parallelism, both using more cores and having more parallelism in each core, e.g. in GPUs and Intel Xeon Phi. Adapting existing plasma physics codes is challenging, in particular as there is no single programming model that covers current and future architectures. We will introduce the open-source libmrc framework that has been used to modularize and port three plasma physics codes: The extended MHD code MRCv3 with implicit time integration and curvilinear grids; the OpenGGCM global magnetosphere model; and the particle-in-cell code PSC. libmrc consolidates basic functionality needed for simulations based on structured grids (I/O, load balancing, time integrators), and also introduces a parallel object model that makes it possible to maintain multiple implementations of computational kernels, on e.g. conventional processors and GPUs. It handles data layout conversions and enables us to port performance-critical parts of a code to a new architecture step-by-step, while the rest of the code can remain unchanged. We will show examples of the performance gains and some physics applications.
Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

Directory of Open Access Journals (Sweden)

Jan Masek

2016-06-01

Full Text Available Using modern Graphic Processing Units (GPUs becomes very useful for computing complex and time consuming processes. GPUs provide high–performance computation capabilities with a good price. This paper deals with a multi–GPU OpenCL and CUDA implementations of k–Nearest Neighbor (k–NN algorithm. This work compares performances of OpenCLand CUDA implementations where each of them is suitable for different number of used attributes. The proposed CUDA algorithm achieves acceleration up to 880x in comparison witha single thread CPU version. The common k-NN was modified to be faster when the lower number of k neighbors is set. The performance of algorithm was verified with two GPUs dual-core NVIDIA GeForce GTX 690 and CPU Intel Core i7 3770 with 4.1 GHz frequency. The results of speed up were measured for one GPU, two GPUs, three and four GPUs. We performed several tests with data sets containing up to 4 million elements with various number of attributes.
A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

Energy Technology Data Exchange (ETDEWEB)

Liu, T.; Ding, A.; Ji, W.; Xu, X. G. [Nuclear Engineering and Engineering Physics, Rensselaer Polytechnic Inst., Troy, NY 12180 (United States); Carothers, C. D. [Dept. of Computer Science, Rensselaer Polytechnic Inst. RPI (United States); Brown, F. B. [Los Alamos National Laboratory (LANL) (United States)

2012-07-01

Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of {approx}2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
A Monte Carlo neutron transport code for eigenvalue calculations on a dual-GPU system and CUDA environment

International Nuclear Information System (INIS)

Liu, T.; Ding, A.; Ji, W.; Xu, X. G.; Carothers, C. D.; Brown, F. B.

2012-01-01

Monte Carlo (MC) method is able to accurately calculate eigenvalues in reactor analysis. Its lengthy computation time can be reduced by general-purpose computing on Graphics Processing Units (GPU), one of the latest parallel computing techniques under development. The method of porting a regular transport code to GPU is usually very straightforward due to the 'embarrassingly parallel' nature of MC code. However, the situation becomes different for eigenvalue calculation in that it will be performed on a generation-by-generation basis and the thread coordination should be explicitly taken care of. This paper presents our effort to develop such a GPU-based MC code in Compute Unified Device Architecture (CUDA) environment. The code is able to perform eigenvalue calculation under simple geometries on a multi-GPU system. The specifics of algorithm design, including thread organization and memory management were described in detail. The original CPU version of the code was tested on an Intel Xeon X5660 2.8 GHz CPU, and the adapted GPU version was tested on NVIDIA Tesla M2090 GPUs. Double-precision floating point format was used throughout the calculation. The result showed that a speedup of 7.0 and 33.3 were obtained for a bare spherical core and a binary slab system respectively. The speedup factor was further increased by a factor of ∼2 on a dual GPU system. The upper limit of device-level parallelism was analyzed, and a possible method to enhance the thread-level parallelism was proposed. (authors)
GPU-computing in econophysics and statistical physics

Science.gov (United States)

Preis, T.

2011-03-01

A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. In particular computationally expensive analyses employed in financial market context are coded on a graphics card architecture which leads to a significant reduction of computing time. In order to demonstrate the wide range of possible applications, a standard model in statistical physics - the Ising model - is ported to a graphics card architecture as well, resulting in large speedup values.
High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

International Nuclear Information System (INIS)

Nieto, J.; Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M.; Vega, J.; Castro, R.

2016-01-01

Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.
High performance image acquisition and processing architecture for fast plant system controllers based on FPGA and GPU

Energy Technology Data Exchange (ETDEWEB)

Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Sanz, D.; Guillén, P.; Esquembri, S.; Arcas, G. de; Ruiz, M. [Grupo de Investigación en Instrumentación y Acústica Aplicada, Universidad Politécnica de Madrid, Crta. Valencia Km-7, Madrid 28031 (Spain); Vega, J.; Castro, R. [Asociación EURATOM/CIEMAT para Fusión, Madrid (Spain)

2016-11-15

Highlights: • To test an image acquisition and processing system for Camera Link devices based in a FPGA, compliant with ITER fast controllers. • To move data acquired from the set NI1483-NIPXIe7966R directly to a NVIDIA GPU using NVIDIA GPUDirect RDMA technology. • To obtain a methodology to include GPUs processing in ITER Fast Plant Controllers, using EPICS integration through Nominal Device Support (NDS). - Abstract: The two dominant technologies that are being used in real time image processing are Field Programmable Gate Array (FPGA) and Graphical Processor Unit (GPU) due to their algorithm parallelization capabilities. But not much work has been done to standardize how these technologies can be integrated in data acquisition systems, where control and supervisory requirements are in place, such as ITER (International Thermonuclear Experimental Reactor). This work proposes an architecture, and a development methodology, to develop image acquisition and processing systems based on FPGAs and GPUs compliant with ITER fast controller solutions. A use case based on a Camera Link device connected to an FPGA DAQ device (National Instruments FlexRIO technology), and a NVIDIA Tesla GPU series card has been developed and tested. The architecture proposed has been designed to optimize system performance by minimizing data transfer operations and CPU intervention thanks to the use of NVIDIA GPUDirect RDMA and DMA technologies. This allows moving the data directly between the different hardware elements (FPGA DAQ-GPU-CPU) avoiding CPU intervention and therefore the use of intermediate CPU memory buffers. A special effort has been put to provide a development methodology that, maintaining the highest possible abstraction from the low level implementation details, allows obtaining solutions that conform to CODAC Core System standards by providing EPICS and Nominal Device Support.

Gr-GDHP: A New Architecture for Globalized Dual Heuristic Dynamic Programming.

Science.gov (United States)

Zhong, Xiangnan; Ni, Zhen; He, Haibo

2017-10-01

Goal representation globalized dual heuristic dynamic programming (Gr-GDHP) method is proposed in this paper. A goal neural network is integrated into the traditional GDHP method providing an internal reinforcement signal and its derivatives to help the control and learning process. From the proposed architecture, it is shown that the obtained internal reinforcement signal and its derivatives can be able to adjust themselves online over time rather than a fixed or predefined function in literature. Furthermore, the obtained derivatives can directly contribute to the objective function of the critic network, whose learning process is thus simplified. Numerical simulation studies are applied to show the performance of the proposed Gr-GDHP method and compare the results with other existing adaptive dynamic programming designs. We also investigate this method on a ball-and-beam balancing system. The statistical simulation results are presented for both the Gr-GDHP and the GDHP methods to demonstrate the improved learning and controlling performance.
Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

KAUST Repository

Ltaief, Hatem; Gratadour, Damien; Charara, Ali; Gendron, Eric

2016-01-01

We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used
Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

Science.gov (United States)

Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G

2011-07-01

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Dual-core Itanium Processor

CERN Multimedia

2006-01-01

Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.
Streaming Multiframe Deconvolutions on GPUs

Science.gov (United States)

Lee, M. A.; Budavári, T.

2015-09-01

Atmospheric turbulence distorts all ground-based observations, which is especially detrimental to faint detections. The point spread function (PSF) defining this blur is unknown for each exposure and varies significantly over time, making image analysis difficult. Lucky imaging and traditional co-adding throws away lots of information. We developed blind deconvolution algorithms that can simultaneously obtain robust solutions for the background image and all the PSFs. It is done in a streaming setting, which makes it practical for large number of big images. We implemented a new tool that runs of GPUs and achieves exceptional running times that can scale to the new time-domain surveys. Our code can quickly and effectively recover high-resolution images exceeding the quality of traditional co-adds. We demonstrate the power of the method on the repeated exposures in the Sloan Digital Sky Survey's Stripe 82.
Control the Morphologies and the Pore Architectures of Meso porous Silicas through a Dual-Templating Approach

International Nuclear Information System (INIS)

Wang, H.; Chen, H.; Xu, Z.; Wang, S.; Li, B.; Li, Y.

2012-01-01

Meso porous silica nanospheres were prepared using a chiral cationic low-molecular-weight amphiphile and organic solvents such as toluene, cyclohexane, and tetrachlorocarbon through a dual-templating approach. X-ray diffraction, nitrogen sorption, field emission scanning electron microscopy, and transmission electron microscopy techniques have been used to characterize the meso porous silicas. The volume ratio of toluene to water plays an important role in controlling the morphologies and the pore architectures of the meso porous silicas. It was also found that meso porous silica nano flakes can be prepared by adding tetrahydrofuran to the reaction mixtures.
Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units

Directory of Open Access Journals (Sweden)

Ciznicki Milosz

2014-12-01

Full Text Available Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs, can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.
Balancing the dual responsibilities of business unit controllers: field and survey evidence

NARCIS (Netherlands)

Maas, V.S.; Matejka, M.

2009-01-01

We examine how business unit (BU) controllers balance their dual roles of providing information for both local decision-making (local responsibility) and corporate control (functional responsibility). The existing literature suggests that organizations can improve the quality of financial reporting
A high performance GPU implementation of Surface Energy Balance System (SEBS) based on CUDA-C

NARCIS (Netherlands)

Abouali, Mohammad; Timmermans, J.; Castillo, Jose E.; Su, Zhongbo

2013-01-01

This paper introduces a new implementation of the Surface Energy Balance System (SEBS) algorithm harnessing the many cores available on Graphics Processing Units (GPUs). This new implementation uses Compute Unified Device Architecture C (CUDA-C) programming model and is designed to be executed on a
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations.

Directory of Open Access Journals (Sweden)

ThienLuan Ho

Full Text Available Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs. In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively.
Maraghe Observatory and an Effort towards Retrieval of Architectural Design of Astronomical Units

Directory of Open Access Journals (Sweden)

Javad Shekari Niri

2015-03-01

Full Text Available Maraghe observatory was built by such engineers as Moayiededdin Orozi etc. under supervision of Khaje Nasireddin Tousi in 7th century AH. The most significant feature associated with Maraghe observatory is the fact that architecture is employed to achieve astronomical purposes in this site. The reason for preferring observatory by astronomers was the fact that these units are superior to wooden and metal instruments with respect to accuracy, no size limitations, etc. Architectural design and function of astronomical units of Maraghe observatory site after discovery of its foundation in the course of explorations before Islamic Revolution remained unclear until recent years. After conducting required studies and investigations, the author managed to find significant cues and after some precise comparisons, he succeeded to recover the main design and function of some astronomical units of this international center. Based on these findings these astronomical structures can reliably be rebuilt. This research showed that every circular or polygonal building cannot be considered as an observatory. For example form and function of cemetery structures are completely different with astronomical ones. Following this research also valuable results were obtained in relation to stone architectural structures present on Maraghe observatory hill. In addition, claims about invention of astronomical units of Maraghe observatory by non-Iranian scientists are rejected and rights of Iranian scientists are rationally defended in this regard.
N-Doped Dual Carbon-Confined 3D Architecture rGO/Fe3O4/AC Nanocomposite for High-Performance Lithium-Ion Batteries.

Science.gov (United States)

Ding, Ranran; Zhang, Jie; Qi, Jie; Li, Zhenhua; Wang, Chengyang; Chen, Mingming

2018-04-25

To address the issues of low electrical conductivity, sluggish lithiation kinetics and dramatic volume variation in Fe 3 O 4 anodes of lithium ion battery, herein, a double carbon-confined three-dimensional (3D) nanocomposite architecture was synthesized by an electrostatically assisted self-assembly strategy. In the constructed architecture, the ultrafine Fe 3 O 4 subunits (∼10 nm) self-organize to form nanospheres (NSs) that are fully coated by amorphous carbon (AC), formatting core-shell structural Fe 3 O 4 /AC NSs. By further encapsulation by reduced graphene oxide (rGO) layers, a constructed 3D architecture was built as dual carbon-confined rGO/Fe 3 O 4 /AC. Such structure restrains the adverse reaction of the electrolyte, improves the electronic conductivity and buffers the mechanical stress of the entire electrode, thus performing excellent long-term cycling stability (99.4% capacity retention after 465 cycles relevant to the second cycle at 5 A g -1 ). Kinetic analysis reveals that a dual lithium storage mechanism including a diffusion reaction mechanism and a surface capacitive behavior mechanism coexists in the composites. Consequently, the resulting rGO/Fe 3 O 4 /AC nanocomposite delivers a high reversible capacity (835.8 mA h g -1 for 300 cycles at 1 A g -1 ), as well as remarkable rate capability (436.7 mA h g -1 at 10 A g -1 ).
Suppression of surface-originated gate lag by a dual-channel AlN/GaN high electron mobility transistor architecture

Science.gov (United States)

Deen, David A.; Storm, David F.; Scott Katzer, D.; Bass, R.; Meyer, David J.

2016-08-01

A dual-channel AlN/GaN high electron mobility transistor (HEMT) architecture is demonstrated that leverages ultra-thin epitaxial layers to suppress surface-related gate lag. Two high-density two-dimensional electron gas (2DEG) channels are utilized in an AlN/GaN/AlN/GaN heterostructure wherein the top 2DEG serves as a quasi-equipotential that screens potential fluctuations resulting from distributed surface and interface states. The bottom channel serves as the transistor's modulated channel. Dual-channel AlN/GaN heterostructures were grown by molecular beam epitaxy on free-standing hydride vapor phase epitaxy GaN substrates. HEMTs fabricated with 300 nm long recessed gates demonstrated a gate lag ratio (GLR) of 0.88 with no degradation in drain current after bias stressed in subthreshold. These structures additionally achieved small signal metrics ft/fmax of 27/46 GHz. These performance results are contrasted with the non-recessed gate dual-channel HEMT with a GLR of 0.74 and 82 mA/mm current collapse with ft/fmax of 48/60 GHz.
Three-dimensional discrete ordinates reactor assembly calculations on GPUs

Energy Technology Data Exchange (ETDEWEB)

Evans, Thomas M [ORNL; Joubert, Wayne [ORNL; Hamilton, Steven P [ORNL; Johnson, Seth R [ORNL; Turner, John A [ORNL; Davidson, Gregory G [ORNL; Pandya, Tara M [ORNL

2015-01-01

In this paper we describe and demonstrate a discrete ordinates sweep algorithm on GPUs. This sweep algorithm is nested within a multilevel comunication-based decomposition based on energy. We demonstrated the effectiveness of this algorithm on detailed three-dimensional critical experiments and PWR lattice problems. For these problems we show improvement factors of 4 6 over conventional communication-based, CPU-only sweeps. These sweep kernel speedups resulted in a factor of 2 total time-to-solution improvement.
A Communication Architecture for an Advanced Extravehicular Mobile Unit

Science.gov (United States)

Ivancic, William D.; Sands, Obed S.; Bakula, Casey J.; Oldham, Daniel R.; Wright, Ted; Bradish, Martin A.; Klebau, Joseph M.

2014-01-01

This document describes the communication architecture for the Power, Avionics and Software (PAS) 1.0 subsystem for the Advanced Extravehicular Mobility Unit (AEMU). The following systems are described in detail: Caution Warning and Control System, Informatics, Storage, Video, Audio, Communication, and Monitoring Test and Validation. This document also provides some background as well as the purpose and goals of the PAS subsystem being developed at Glenn Research Center (GRC).
Accelerating the explicitly restarted Arnoldi method with GPUs using an auto-tuned matrix vector product

International Nuclear Information System (INIS)

Dubois, J.; Calvin, Ch.; Dubois, J.; Petiton, S.

2011-01-01

This paper presents a parallelized hybrid single-vector Arnoldi algorithm for computing approximations to Eigen-pairs of a nonsymmetric matrix. We are interested in the use of accelerators and multi-core units to speed up the Arnoldi process. The main goal is to propose a parallel version of the Arnoldi solver, which can efficiently use multiple multi-core processors or multiple graphics processing units (GPUs) in a mixed coarse and fine grain fashion. In the proposed algorithms, this is achieved by an auto-tuning of the matrix vector product before starting the Arnoldi Eigen-solver as well as the reorganization of the data and global communications so that communication time is reduced. The execution time, performance, and scalability are assessed with well-known dense and sparse test matrices on multiple Nehalems, GT200 NVidia Tesla, and next generation Fermi Tesla. With one processor, we see a performance speedup of 2 to 3x when using all the physical cores, and a total speedup of 2 to 8x when adding a GPU to this multi-core unit, and hence a speedup of 4 to 24x compared to the sequential solver. (authors)
Partial wave analysis using graphics processing units

Energy Technology Data Exchange (ETDEWEB)

Berger, Niklaus; Liu Beijiang; Wang Jike, E-mail: nberger@ihep.ac.c [Institute of High Energy Physics, Chinese Academy of Sciences, 19B Yuquan Lu, Shijingshan, 100049 Beijing (China)

2010-04-01

Partial wave analysis is an important tool for determining resonance properties in hadron spectroscopy. For large data samples however, the un-binned likelihood fits employed are computationally very expensive. At the Beijing Spectrometer (BES) III experiment, an increase in statistics compared to earlier experiments of up to two orders of magnitude is expected. In order to allow for a timely analysis of these datasets, additional computing power with short turnover times has to be made available. It turns out that graphics processing units (GPUs) originally developed for 3D computer games have an architecture of massively parallel single instruction multiple data floating point units that is almost ideally suited for the algorithms employed in partial wave analysis. We have implemented a framework for tensor manipulation and partial wave fits called GPUPWA. The user writes a program in pure C++ whilst the GPUPWA classes handle computations on the GPU, memory transfers, caching and other technical details. In conjunction with a recent graphics processor, the framework provides a speed-up of the partial wave fit by more than two orders of magnitude compared to legacy FORTRAN code.
Graphics processing units in bioinformatics, computational biology and systems biology.

Science.gov (United States)

Nobile, Marco S; Cazzaniga, Paolo; Tangherloni, Andrea; Besozzi, Daniela

2017-09-01

Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools. © The Author 2016. Published by Oxford University Press.
A flexible algorithm for calculating pair interactions on SIMD architectures

Science.gov (United States)

Páll, Szilárd; Hess, Berk

2013-12-01

Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
SIRFING: Sparse Image Reconstruction For INterferometry using GPUs

Science.gov (United States)

Cranmer, Miles; Garsden, Hugh; Mitchell, Daniel A.; Greenhill, Lincoln

2018-01-01

We present a deconvolution code for radio interferometric imaging based on the compressed sensing algorithms in Garsden et al. (2015). Being computationally intensive, compressed sensing is ripe for parallelization over GPUs. Our compressed sensing implementation generates images using wavelets, and we have ported the underlying wavelet library to CUDA, targeting the spline filter reconstruction part of the algorithm. The speedup achieved is almost an order of magnitude. The code is modular but is also being integrated into the calibration and imaging pipeline in use by the LEDA project at the Long Wavelength Array (LWA) as well as by the Murchinson Widefield Array (MWA).

Area-delay trade-offs of texture decompressors for a graphics processing unit

Science.gov (United States)

Novoa Súñer, Emilio; Ituero, Pablo; López-Vallejo, Marisa

2011-05-01

Graphics Processing Units have become a booster for the microelectronics industry. However, due to intellectual property issues, there is a serious lack of information on implementation details of the hardware architecture that is behind GPUs. For instance, the way texture is handled and decompressed in a GPU to reduce bandwidth usage has never been dealt with in depth from a hardware point of view. This work addresses a comparative study on the hardware implementation of different texture decompression algorithms for both conventional (PCs and video game consoles) and mobile platforms. Circuit synthesis is performed targeting both a reconfigurable hardware platform and a 90nm standard cell library. Area-delay trade-offs have been extensively analyzed, which allows us to compare the complexity of decompressors and thus determine suitability of algorithms for systems with limited hardware resources.
78 FR 12369 - United States Government Policy for Institutional Oversight of Life Sciences Dual Use Research of...

Science.gov (United States)

2013-02-22

... Oversight of Life Sciences Dual Use Research of Concern AGENCY: Office of Science and Technology Policy... comments on the proposed United States Government Policy for Institutional Oversight of Life Sciences Dual... requirements for certain categories of life sciences research at institutions that accept Federal funding for...
Suppression of surface-originated gate lag by a dual-channel AlN/GaN high electron mobility transistor architecture

International Nuclear Information System (INIS)

Deen, David A.; Storm, David F.; Scott Katzer, D.; Bass, R.; Meyer, David J.

2016-01-01

A dual-channel AlN/GaN high electron mobility transistor (HEMT) architecture is demonstrated that leverages ultra-thin epitaxial layers to suppress surface-related gate lag. Two high-density two-dimensional electron gas (2DEG) channels are utilized in an AlN/GaN/AlN/GaN heterostructure wherein the top 2DEG serves as a quasi-equipotential that screens potential fluctuations resulting from distributed surface and interface states. The bottom channel serves as the transistor's modulated channel. Dual-channel AlN/GaN heterostructures were grown by molecular beam epitaxy on free-standing hydride vapor phase epitaxy GaN substrates. HEMTs fabricated with 300 nm long recessed gates demonstrated a gate lag ratio (GLR) of 0.88 with no degradation in drain current after bias stressed in subthreshold. These structures additionally achieved small signal metrics f_t/f_m_a_x of 27/46 GHz. These performance results are contrasted with the non-recessed gate dual-channel HEMT with a GLR of 0.74 and 82 mA/mm current collapse with f_t/f_m_a_x of 48/60 GHz.
Suppression of surface-originated gate lag by a dual-channel AlN/GaN high electron mobility transistor architecture

Energy Technology Data Exchange (ETDEWEB)

Deen, David A., E-mail: david.deen@alumni.nd.edu; Storm, David F.; Scott Katzer, D.; Bass, R.; Meyer, David J. [Naval Research Laboratory, Electronics Science and Technology Division, Washington, DC 20375 (United States)

2016-08-08

A dual-channel AlN/GaN high electron mobility transistor (HEMT) architecture is demonstrated that leverages ultra-thin epitaxial layers to suppress surface-related gate lag. Two high-density two-dimensional electron gas (2DEG) channels are utilized in an AlN/GaN/AlN/GaN heterostructure wherein the top 2DEG serves as a quasi-equipotential that screens potential fluctuations resulting from distributed surface and interface states. The bottom channel serves as the transistor's modulated channel. Dual-channel AlN/GaN heterostructures were grown by molecular beam epitaxy on free-standing hydride vapor phase epitaxy GaN substrates. HEMTs fabricated with 300 nm long recessed gates demonstrated a gate lag ratio (GLR) of 0.88 with no degradation in drain current after bias stressed in subthreshold. These structures additionally achieved small signal metrics f{sub t}/f{sub max} of 27/46 GHz. These performance results are contrasted with the non-recessed gate dual-channel HEMT with a GLR of 0.74 and 82 mA/mm current collapse with f{sub t}/f{sub max} of 48/60 GHz.
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.

Directory of Open Access Journals (Sweden)

Ahmed Shamsul Arefin

Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units

Directory of Open Access Journals (Sweden)

Maskell Douglas L

2009-05-01

Full Text Available Abstract Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs.
Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

KAUST Repository

Charara, Ali; Keyes, David E.; Ltaief, Hatem

2017-01-01

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

KAUST Repository

Charara, Ali

2017-03-06

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Energy Technology Data Exchange (ETDEWEB)

Ronald Babich, Michael Clark, Balint Joo

2010-11-01

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

International Nuclear Information System (INIS)

Babich, Ronald; Clark, Michael; Joo, Balint

2010-01-01

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the '9g' cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.
Enhanced static ground power unit based on flying capacitor based h-bridge hybrid active-neutral-point-clamped converter

DEFF Research Database (Denmark)

Abarzadeh, Mostafa; Madadi Kojabadi, Hossein; Deng, Fujin

2016-01-01

Static power converters have various applications, such as static ground power units (GPUs) for airplanes. This study proposes a new configuration of a static GPU based on a novel nine-level flying capacitor h-bridge active-neutral-point-clamped (FCHB_ANPC) converter. The main advantages of the p......Static power converters have various applications, such as static ground power units (GPUs) for airplanes. This study proposes a new configuration of a static GPU based on a novel nine-level flying capacitor h-bridge active-neutral-point-clamped (FCHB_ANPC) converter. The main advantages...
Architectural improvements and technological enhancements for the APEnet+ interconnect system

International Nuclear Information System (INIS)

Ammendola, R.; Biagioni, A.; Frezza, O.; Lonardo, A.; Cicero, F. Lo; Martinelli, M.; Paolucci, P.S.; Pastorelli, E.; Simula, F.; Tosoratto, L.; Vicini, P.; Rossetti, D.

2015-01-01

The APEnet+ board delivers a point-to-point, low-latency, 3D torus network interface card. In this paper we describe the latest generation of APEnet NIC, APEnet v5, integrated in a PCIe Gen3 board based on a state-of-the-art, 28 nm Altera Stratix V FPGA. The NIC features a network architecture designed following the Remote DMA paradigm and tailored to tightly bind the computing power of modern GPUs to the communication fabric. For the APEnet v5 board we show characterizing figures as achieved bandwidth and BER obtained by exploiting new high performance ALTERA transceivers and PCIe Gen3 compliancy
Jet browser model accelerated by GPUs

Directory of Open Access Journals (Sweden)

Forster Richárd

2016-12-01

Full Text Available In the last centuries the experimental particle physics began to develop thank to growing capacity of computers among others. It is allowed to know the structure of the matter to level of quark gluon. Plasma in the strong interaction. Experimental evidences supported the theory to measure the predicted results. Since its inception the researchers are interested in the track reconstruction. We studied the jet browser model, which was developed for 4π calorimeter. This method works on the measurement data set, which contain the components of interaction points in the detector space and it allows to examine the trajectory reconstruction of the final state particles. We keep the total energy in constant values and it satisfies the Gauss law. Using GPUs the evaluation of the model can be drastically accelerated, as we were able to achieve up to 223 fold speedup compared to a CPU based parallel implementation.
A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

Directory of Open Access Journals (Sweden)

Joseph D. Garvey

2018-01-01

Full Text Available We propose and evaluate a novel strategy for tuning the performance of a class of stencil computations on Graphics Processing Units. The strategy uses a machine learning model to predict the optimal way to load data from memory followed by a heuristic that divides other optimizations into groups and exhaustively explores one group at a time. We use a set of 104 synthetic OpenCL stencil benchmarks that are representative of many real stencil computations. We first demonstrate the need for auto-tuning by showing that the optimization space is sufficiently complex that simple approaches to determining a high-performing configuration fail. We then demonstrate the effectiveness of our approach on NVIDIA and AMD GPUs. Relative to a random sampling of the space, we find configurations that are 12%/32% faster on the NVIDIA/AMD platform in 71% and 4% less time, respectively. Relative to an expert search, we achieve 5% and 9% better performance on the two platforms in 89% and 76% less time. We also evaluate our strategy for different stencil computational intensities, varying array sizes and shapes, and in combination with expert search.
Mosaic: An Application-Transparent Hardware-Software Cooperative Memory Manager for GPUs

OpenAIRE

Ausavarungnirun, Rachata; Landgraf, Joshua; Miller, Vance; Ghose, Saugata; Gandhi, Jayneel; Rossbach, Christopher J.; Mutlu, Onur

2018-01-01

Modern GPUs face a trade-off on how the page size used for memory management affects address translation and demand paging. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing and splintering policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present a...
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

International Nuclear Information System (INIS)

Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y.

2014-01-01

We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors
PuReMD-GPU: A reactive molecular dynamics simulation package for GPUs

Energy Technology Data Exchange (ETDEWEB)

Kylasa, S.B., E-mail: skylasa@purdue.edu [Department of Elec. and Comp. Eng., Purdue University, West Lafayette, IN 47907 (United States); Aktulga, H.M., E-mail: hmaktulga@lbl.gov [Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, MS 50F-1650, Berkeley, CA 94720 (United States); Grama, A.Y., E-mail: ayg@cs.purdue.edu [Department of Computer Science, Purdue University, West Lafayette, IN 47907 (United States)

2014-09-01

We present an efficient and highly accurate GP-GPU implementation of our community code, PuReMD, for reactive molecular dynamics simulations using the ReaxFF force field. PuReMD and its incorporation into LAMMPS (Reax/C) is used by a large number of research groups worldwide for simulating diverse systems ranging from biomembranes to explosives (RDX) at atomistic level of detail. The sub-femtosecond time-steps associated with ReaxFF strongly motivate significant improvements to per-timestep simulation time through effective use of GPUs. This paper presents, in detail, the design and implementation of PuReMD-GPU, which enables ReaxFF simulations on GPUs, as well as various performance optimization techniques we developed to obtain high performance on state-of-the-art hardware. Comprehensive experiments on model systems (bulk water and amorphous silica) are presented to quantify the performance improvements achieved by PuReMD-GPU and to verify its accuracy. In particular, our experiments show up to 16× improvement in runtime compared to our highly optimized CPU-only single-core ReaxFF implementation. PuReMD-GPU is a unique production code, and is currently available on request from the authors.
Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs

International Nuclear Information System (INIS)

Bollig, Evan F.; Flyer, Natasha; Erlebacher, Gordon

2012-01-01

This paper presents parallelization strategies for the radial basis function-finite difference (RBF-FD) method. As a generalized finite differencing scheme, the RBF-FD method functions without the need for underlying meshes to structure nodes. It offers high-order accuracy approximation and scales as O(N) per time step, with N being with the total number of nodes. To our knowledge, this is the first implementation of the RBF-FD method to leverage GPU accelerators for the solution of PDEs. Additionally, this implementation is the first to span both multiple CPUs and multiple GPUs. OpenCL kernels target the GPUs and inter-processor communication and synchronization is managed by the Message Passing Interface (MPI). We verify our implementation of the RBF-FD method with two hyperbolic PDEs on the sphere, and demonstrate up to 9x speedup on a commodity GPU with unoptimized kernel implementations. On a high performance cluster, the method achieves up to 7x speedup for the maximum problem size of 27,556 nodes.
A framework for dense triangular matrix kernels on various manycore architectures

KAUST Repository

Charara, Ali

2017-06-06

We present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar\\'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library at https://github.com/ecrc/kblas.
A framework for dense triangular matrix kernels on various manycore architectures

KAUST Repository

Charara, Ali; Keyes, David E.; Ltaief, Hatem

2017-01-01

We present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library at https://github.com/ecrc/kblas.

Proposal for Dual Pressurized Light Water Reactor Unit Producing 2000 MWe

International Nuclear Information System (INIS)

Kang, Kyoung Min; Noh, Sang Woo; Suh, Kune Yull

2009-01-01

The Dual Unit Optimizer 2000 MWe (DUO2000) is put forward as a new design concept for large power nuclear plants to cope with economic and safety challenges facing the 21 st century green and sustainable energy industry. DUO2000 is home to two nuclear steam supply systems (NSSSs) of the Optimized Power Reactor 1000 MWe (OPR1000)-like pressurized water reactor (PWR) in single containment so as to double the capacity of the plant. The idea behind DUO may as well be extended to combining any number of NSSSs of PWRs or pressurized heavy water reactors (PHWRs), or even boiling water reactors (BWRs). Once proven in water reactors, the technology may even be expanded to gas cooled, liquid metal cooled, and molten salt cooled reactors. With its in-vessel retention external reactor vessel cooling (IVR-ERVC) as severe accident management strategy, DUO can not only put the single most querulous PWR safety issue to an end, but also pave the way to very promising large power capacity while dispensing with the huge redesigning cost for Generation III+ nuclear systems. Five prototypes are presented for the DUO2000, and their respective advantages and drawbacks are considered. The strengths include, but are not necessarily limited to, reducing the cost of construction by decreasing the number of containment buildings from two to one, minimizing the cost of NSSS and control systems by sharing between the dual units, and lessening the maintenance cost by uniting the NSSS, just to name the few. The latent threats are discussed as well
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

Science.gov (United States)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Massive parallelization of a 3D finite difference electromagnetic forward solution using domain decomposition methods on multiple CUDA enabled GPUs

Science.gov (United States)

Schultz, A.

2010-12-01

3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We
GPUs for fast pattern matching in the RICH of the NA62 experiment

International Nuclear Information System (INIS)

Lamanna, Gianluca; Collazuol, Gianmaria; Sozzi, Marco

2011-01-01

In rare decays experiments an effective online selection is a fundamental part of the data acquisition system (DAQ) in order to reduce both the quantity of data written on tape and the bandwidth requirements for the DAQ system. A multilevel architecture is commonly used to achieve a higher reduction factor, exploiting dedicated custom hardware and flexible software in standard computers. In this paper we discuss the possibility to use commercial video card processors (GPU) to build a fast and effective trigger system, both at hardware and software level. The computing power of the GPUs allows to design a real-time system in which trigger decisions are taken directly in the video processor with a defined maximum latency. This allows building lowest trigger levels based on standard off-the-shelf PCs with CPU and GPU (instead of the commonly adopted solutions based on custom electronics with FPGA or ASICs) with enhanced and high performance computation capabilities, resulting in high rejection power, high efficiency and simpler low level triggers. The ongoing work presented here shows the results achieved in the case of fast pattern matching in the RICH detector of the NA62 at CERN, aiming at measuring the Branching Ratio of the ultra rare decay K + →π + νν-bar, is considered as use case, although the versatility and the customizability of this approach easily allow exporting the concept to different contexts. In particular the application is related to particle identification in the RICH detector of the NA62 experiment, where the rate of events to be analyzed will be around 10 MHz. The results obtained in lab tests are very encouraging to go towards a working prototype. Due to the use of off-the-shelf technology, in continuous development for other purposes (Video Games, image editing,...), the architecture described would be easily exported into other experiments, for building powerful, flexible and fully customizable trigger systems.
GPUs for fast pattern matching in the RICH of the NA62 experiment

Energy Technology Data Exchange (ETDEWEB)

Lamanna, Gianluca, E-mail: gianluca.lamanna@cern.c [CERN, 1211 Geneve 23 (Switzerland); Collazuol, Gianmaria, E-mail: gianmaria.collazuol@cern.c [INFN Pisa, Largo Pontecorvo 3, 56127 Pisa (Italy); Sozzi, Marco, E-mail: marco.sozzi@cern.c [University and INFN Pisa, Largo Pontecorvo 3, 56127 Pisa (Italy)

2011-05-21

In rare decays experiments an effective online selection is a fundamental part of the data acquisition system (DAQ) in order to reduce both the quantity of data written on tape and the bandwidth requirements for the DAQ system. A multilevel architecture is commonly used to achieve a higher reduction factor, exploiting dedicated custom hardware and flexible software in standard computers. In this paper we discuss the possibility to use commercial video card processors (GPU) to build a fast and effective trigger system, both at hardware and software level. The computing power of the GPUs allows to design a real-time system in which trigger decisions are taken directly in the video processor with a defined maximum latency. This allows building lowest trigger levels based on standard off-the-shelf PCs with CPU and GPU (instead of the commonly adopted solutions based on custom electronics with FPGA or ASICs) with enhanced and high performance computation capabilities, resulting in high rejection power, high efficiency and simpler low level triggers. The ongoing work presented here shows the results achieved in the case of fast pattern matching in the RICH detector of the NA62 at CERN, aiming at measuring the Branching Ratio of the ultra rare decay K{sup +}{yields}{pi}{sup +}{nu}{nu}-bar, is considered as use case, although the versatility and the customizability of this approach easily allow exporting the concept to different contexts. In particular the application is related to particle identification in the RICH detector of the NA62 experiment, where the rate of events to be analyzed will be around 10 MHz. The results obtained in lab tests are very encouraging to go towards a working prototype. Due to the use of off-the-shelf technology, in continuous development for other purposes (Video Games, image editing,...), the architecture described would be easily exported into other experiments, for building powerful, flexible and fully customizable trigger systems.
TernaryNet: faster deep model inference without GPUs for medical 3D segmentation using sparse and binary convolutions.

Science.gov (United States)

Heinrich, Mattias P; Blendowski, Max; Oktay, Ozan

2018-05-30

Deep convolutional neural networks (DCNN) are currently ubiquitous in medical imaging. While their versatility and high-quality results for common image analysis tasks including segmentation, localisation and prediction is astonishing, the large representational power comes at the cost of highly demanding computational effort. This limits their practical applications for image-guided interventions and diagnostic (point-of-care) support using mobile devices without graphics processing units (GPU). We propose a new scheme that approximates both trainable weights and neural activations in deep networks by ternary values and tackles the open question of backpropagation when dealing with non-differentiable functions. Our solution enables the removal of the expensive floating-point matrix multiplications throughout any convolutional neural network and replaces them by energy- and time-preserving binary operators and population counts. We evaluate our approach for the segmentation of the pancreas in CT. Here, our ternary approximation within a fully convolutional network leads to more than 90% memory reductions and high accuracy (without any post-processing) with a Dice overlap of 71.0% that comes close to the one obtained when using networks with high-precision weights and activations. We further provide a concept for sub-second inference without GPUs and demonstrate significant improvements in comparison with binary quantisation and without our proposed ternary hyperbolic tangent continuation. We present a key enabling technique for highly efficient DCNN inference without GPUs that will help to bring the advances of deep learning to practical clinical applications. It has also great promise for improving accuracies in large-scale medical data retrieval.
Dual-scale topology optoelectronic processor.

Science.gov (United States)

Marsden, G C; Krishnamoorthy, A V; Esener, S C; Lee, S H

1991-12-15

The dual-scale topology optoelectronic processor (D-STOP) is a parallel optoelectronic architecture for matrix algebraic processing. The architecture can be used for matrix-vector multiplication and two types of vector outer product. The computations are performed electronically, which allows multiplication and summation concepts in linear algebra to be generalized to various nonlinear or symbolic operations. This generalization permits the application of D-STOP to many computational problems. The architecture uses a minimum number of optical transmitters, which thereby reduces fabrication requirements while maintaining area-efficient electronics. The necessary optical interconnections are space invariant, minimizing space-bandwidth requirements.
A dual-unit pressure sensor for on-chip self-compensation of zero-point temperature drift

International Nuclear Information System (INIS)

Wang, Jiachou; Li, Xinxin

2014-01-01

A novel dual-unit piezoresistive pressure sensor, consisting of a sensing unit and a dummy unit, is proposed and developed for on-chip self-compensation for zero-point temperature drift. With an MIS (microholes inter-etch and sealing) process implemented only from the front side of single (1 1 1) silicon wafers, a pressure sensitive unit and another identically structured pressure insensitive dummy unit are compactly integrated on-chip to eliminate unbalance factors induced zero-point temperature-drift by mutual compensation between the two units. Besides, both units are physically suspended from silicon substrate to further suppress packaging-stress induced temperature drift. A simultaneously processes ventilation hole-channel structure is connected with the pressure reference cavity of the dummy unit to make it insensitive to detected pressure. In spite of the additional dummy unit, the sensor chip dimensions are still as small as 1.2 mm × 1.2 mm × 0.4 mm. The proposed dual-unit sensor is fabricated and tested, with the tested sensitivity being 0.104 mV kPa −1 3.3 V −1 , nonlinearity of less than 0.08% · FSO and overall accuracy error of ± 0.18% · FSO. Without using any extra compensation method, the sensor features an ultra-low temperature coefficient of offset (TCO) of 0.002% °C −1 · FSO that is much better than the performance of conventional pressure sensors. The highly stable and small-sized sensors are promising for low cost production and applications. (paper)
CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing

OpenAIRE

YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo

2009-01-01

[ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...
Efficient parallel implementation of active appearance model fitting algorithm on GPU.

Science.gov (United States)

Wang, Jinwei; Ma, Xirong; Zhu, Yuanping; Sun, Jizhou

2014-01-01

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
The architecture design of a 2mW 18-bit high speed weight voltage type DAC based on dual weight resistance chain

Science.gov (United States)

Qixing, Chen; Qiyu, Luo

2013-03-01

At present, the architecture of a digital-to-analog converter (DAC) in essence is based on the weight current, and the average value of its D/A signal current increases in geometric series according to its digital signal bits increase, which is 2n-1 times of its least weight current. But for a dual weight resistance chain type DAC, by using the weight voltage manner to D/A conversion, the D/A signal current is fixed to chain current Icha; it is only 1/2n-1 order of magnitude of the average signal current value of the weight current type DAC. Its principle is: n pairs dual weight resistances form a resistance chain, which ensures the constancy of the chain current; if digital signals control the total weight resistance from the output point to the zero potential point, that could directly control the total weight voltage of the output point, so that the digital signals directly turn into a sum of the weight voltage signals; thus the following goals are realized: (1) the total current is less than 200 μA (2) the total power consumption is less than 2 mW; (3) an 18-bit conversion can be realized by adopting a multi-grade structure; (4) the chip area is one order of magnitude smaller than the subsection current-steering type DAC; (5) the error depends only on the error of the unit resistance, so it is smaller than the error of the subsection current-steering type DAC; (6) the conversion time is only one action time of switch on or off, so its speed is not lower than the present DAC.
The architecture design of a 2mW 18-bit high speed weight voltage type DAC based on dual weight resistance chain

International Nuclear Information System (INIS)

Chen Qixing; Luo Qiyu

2013-01-01

At present, the architecture of a digital-to-analog converter (DAC) in essence is based on the weight current, and the average value of its D/A signal current increases in geometric series according to its digital signal bits increase, which is 2 n−1 times of its least weight current. But for a dual weight resistance chain type DAC, by using the weight voltage manner to D/A conversion, the D/A signal current is fixed to chain current I cha ; it is only 1/2 n−1 order of magnitude of the average signal current value of the weight current type DAC. Its principle is: n pairs dual weight resistances form a resistance chain, which ensures the constancy of the chain current; if digital signals control the total weight resistance from the output point to the zero potential point, that could directly control the total weight voltage of the output point, so that the digital signals directly turn into a sum of the weight voltage signals; thus the following goals are realized: (1) the total current is less than 200 μA; (2) the total power consumption is less than 2 mW; (3) an 18-bit conversion can be realized by adopting a multi-grade structure; (4) the chip area is one order of magnitude smaller than the subsection current-steering type DAC; (5) the error depends only on the error of the unit resistance, so it is smaller than the error of the subsection current-steering type DAC; (6) the conversion time is only one action time of switch on or off, so its speed is not lower than the present DAC. (semiconductor integrated circuits)
Porting of the transfer-matrix method for multilayer thin-film computations on graphics processing units

Science.gov (United States)

Limmer, Steffen; Fey, Dietmar

2013-07-01

Thin-film computations are often a time-consuming task during optical design. An efficient way to accelerate these computations with the help of graphics processing units (GPUs) is described. It turned out that significant speed-ups can be achieved. We investigate the circumstances under which the best speed-up values can be expected. Therefore we compare different GPUs among themselves and with a modern CPU. Furthermore, the effect of thickness modulation on the speed-up and the runtime behavior depending on the input data is examined.
Auto‐tuning of level 1 and level 2 BLAS for GPUs

DEFF Research Database (Denmark)

Sørensen, Hans Henrik Brandenborg

2013-01-01

). The target hardware is the most recent Nvidia (Santa Clara, CA, USA) Tesla 20‐series (Fermi architecture), which is designed from the ground up for high‐performance computing. We show that it is essentially a matter of fully utilizing the fine‐grained parallelism of the many‐core graphical processing unit...
Architecture of vagal motor units controlling striated muscle of esophagus: peripheral elements patterning peristalsis?

Science.gov (United States)

Powley, Terry L; Mittal, Ravinder K; Baronowsky, Elizabeth A; Hudson, Cherie N; Martin, Felecia N; McAdams, Jennifer L; Mason, Jacqueline K; Phillips, Robert J

2013-12-01

Little is known about the architecture of the vagal motor units that control esophageal striated muscle, in spite of the fact that these units are necessary, and responsible, for peristalsis. The present experiment was designed to characterize the motor neuron projection fields and terminal arbors forming esophageal motor units. Nucleus ambiguus compact formation neurons of the rat were labeled by bilateral intracranial injections of the anterograde tracer dextran biotin. After tracer transport, thoracic and abdominal esophagi were removed and prepared as whole mounts of muscle wall without mucosa or submucosa. Labeled terminal arbors of individual vagal motor neurons (n=78) in the esophageal wall were inventoried, digitized and analyzed morphometrically. The size of individual vagal motor units innervating striated muscle, throughout thoracic and abdominal esophagus, averaged 52 endplates per motor neuron, a value indicative of fine motor control. A majority (77%) of the motor terminal arbors also issued one or more collateral branches that contacted neurons, including nitric oxide synthase-positive neurons, of local myenteric ganglia. Individual motor neuron terminal arbors co-innervated, or supplied endplates in tandem to, both longitudinal and circular muscle fibers in roughly similar proportions (i.e., two endplates to longitudinal for every three endplates to circular fibers). Both the observation that vagal motor unit collaterals project to myenteric ganglia and the fact that individual motor units co-innervate longitudinal and circular muscle layers are consistent with the hypothesis that elements contributing to peristaltic programming inhere, or are "hardwired," in the peripheral architecture of esophageal motor units. © 2013.
Graphics Processing Unit-Enhanced Genetic Algorithms for Solving the Temporal Dynamics of Gene Regulatory Networks.

Science.gov (United States)

García-Calvo, Raúl; Guisado, J L; Diaz-Del-Rio, Fernando; Córdoba, Antonio; Jiménez-Morales, Francisco

2018-01-01

Understanding the regulation of gene expression is one of the key problems in current biology. A promising method for that purpose is the determination of the temporal dynamics between known initial and ending network states, by using simple acting rules. The huge amount of rule combinations and the nonlinear inherent nature of the problem make genetic algorithms an excellent candidate for finding optimal solutions. As this is a computationally intensive problem that needs long runtimes in conventional architectures for realistic network sizes, it is fundamental to accelerate this task. In this article, we study how to develop efficient parallel implementations of this method for the fine-grained parallel architecture of graphics processing units (GPUs) using the compute unified device architecture (CUDA) platform. An exhaustive and methodical study of various parallel genetic algorithm schemes-master-slave, island, cellular, and hybrid models, and various individual selection methods (roulette, elitist)-is carried out for this problem. Several procedures that optimize the use of the GPU's resources are presented. We conclude that the implementation that produces better results (both from the performance and the genetic algorithm fitness perspectives) is simulating a few thousands of individuals grouped in a few islands using elitist selection. This model comprises 2 mighty factors for discovering the best solutions: finding good individuals in a short number of generations, and introducing genetic diversity via a relatively frequent and numerous migration. As a result, we have even found the optimal solution for the analyzed gene regulatory network (GRN). In addition, a comparative study of the performance obtained by the different parallel implementations on GPU versus a sequential application on CPU is carried out. In our tests, a multifold speedup was obtained for our optimized parallel implementation of the method on medium class GPU over an equivalent
Optimizing strassen matrix multiply on GPUs

KAUST Repository

ul Hasan Khan, Ayaz; Al-Mouhamed, Mayez; Fatayer, Allam

2015-01-01

© 2015 IEEE. Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (S-MM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.
Optimizing strassen matrix multiply on GPUs

KAUST Repository

ul Hasan Khan, Ayaz

2015-06-01

© 2015 IEEE. Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (S-MM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.
Dual-Stack Single-Radio Communication Architecture for UAV Acting As a Mobile Node to Collect Data in WSNs.

Science.gov (United States)

Sayyed, Ali; de Araújo, Gustavo Medeiros; Bodanese, João Paulo; Becker, Leandro Buss

2015-09-16

The use of mobile nodes to collect data in a Wireless Sensor Network (WSN) has gained special attention over the last years. Some researchers explore the use of Unmanned Aerial Vehicles (UAVs) as mobile node for such data-collection purposes. Analyzing these works, it is apparent that mobile nodes used in such scenarios are typically equipped with at least two different radio interfaces. The present work presents a Dual-Stack Single-Radio Communication Architecture (DSSRCA), which allows a UAV to communicate in a bidirectional manner with a WSN and a Sink node. The proposed architecture was specifically designed to support different network QoS requirements, such as best-effort and more reliable communications, attending both UAV-to-WSN and UAV-to-Sink communications needs. DSSRCA was implemented and tested on a real UAV, as detailed in this paper. This paper also includes a simulation analysis that addresses bandwidth consumption in an environmental monitoring application scenario. It includes an analysis of the data gathering rate that can be achieved considering different UAV flight speeds. Obtained results show the viability of using a single radio transmitter for collecting data from the WSN and forwarding such data to the Sink node.
Real-time autocorrelator for fluorescence correlation spectroscopy based on graphical-processor-unit architecture: method, implementation, and comparative studies

Science.gov (United States)

Laracuente, Nicholas; Grossman, Carl

2013-03-01

We developed an algorithm and software to calculate autocorrelation functions from real-time photon-counting data using the fast, parallel capabilities of graphical processor units (GPUs). Recent developments in hardware and software have allowed for general purpose computing with inexpensive GPU hardware. These devices are more suited for emulating hardware autocorrelators than traditional CPU-based software applications by emphasizing parallel throughput over sequential speed. Incoming data are binned in a standard multi-tau scheme with configurable points-per-bin size and are mapped into a GPU memory pattern to reduce time-expensive memory access. Applications include dynamic light scattering (DLS) and fluorescence correlation spectroscopy (FCS) experiments. We ran the software on a 64-core graphics pci card in a 3.2 GHz Intel i5 CPU based computer running Linux. FCS measurements were made on Alexa-546 and Texas Red dyes in a standard buffer (PBS). Software correlations were compared to hardware correlator measurements on the same signals. Supported by HHMI and Swarthmore College

Could the peristaltic transition zone be caused by non-uniform esophageal muscle fiber architecture? A simulation study.

Science.gov (United States)

Kou, W; Pandolfino, J E; Kahrilas, P J; Patankar, N A

2017-06-01

Based on a fully coupled computational model of esophageal transport, we analyzed how varied esophageal muscle fiber architecture and/or dual contraction waves (CWs) affect bolus transport. Specifically, we studied the luminal pressure profile in those cases to better understand possible origins of the peristaltic transition zone. Two groups of studies were conducted using a computational model. The first studied esophageal transport with circumferential-longitudinal fiber architecture, helical fiber architecture and various combinations of the two. In the second group, cases with dual CWs and varied muscle fiber architecture were simulated. Overall transport characteristics were examined and the space-time profiles of luminal pressure were plotted and compared. Helical muscle fiber architecture featured reduced circumferential wall stress, greater esophageal distensibility, and greater axial shortening. Non-uniform fiber architecture featured a peristaltic pressure trough between two high-pressure segments. The distal pressure segment showed greater amplitude than the proximal segment, consistent with experimental data. Dual CWs also featured a pressure trough between two high-pressure segments. However, the minimum pressure in the region of overlap was much lower, and the amplitudes of the two high-pressure segments were similar. The efficacy of esophageal transport is greatly affected by muscle fiber architecture. The peristaltic transition zone may be attributable to non-uniform architecture of muscle fibers along the length of the esophagus and/or dual CWs. The difference in amplitude between the proximal and distal pressure segments may be attributable to non-uniform muscle fiber architecture. © 2017 John Wiley & Sons Ltd.
A 1.5 GFLOPS Reciprocal Unit for Computer Graphics

DEFF Research Database (Denmark)

Nannarelli, Alberto; Rasmussen, Morten Sleth; Stuart, Matthias Bo

2006-01-01

The reciprocal operation 1/d is a frequent operation performed in graphics processors (GPUs). In this work, we present the design of a radix-16 reciprocal unit based on the algorithm combining the traditional digit-by-digit algorithm and the approximation of the reciprocal by one Newton-Raphson i...
Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel

Directory of Open Access Journals (Sweden)

Alachiotis Nikolaos

2012-08-01

Full Text Available Abstract Background Aligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel. Results We optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (Single Instruction, Multiple Data intrinsics for x86 SIMD architectures and multi-threading, we obtained a 9-fold acceleration on a single core as well as linear speedups with respect to the number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga Cell Updates per Second using all four physical cores on an Intel i7 2600 CPU running at 3.4 GHz. The average CPU performance (averaged over all test runs is 12.33 GCUPS. We also used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, Multiple Threads architecture. A NVIDIA GeForce 560 GPU delivered peak and average performance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD and SIMT implementations into a hybrid CPU-GPU system that achieved an accumulated peak performance of 33.8 GCUPS. Conclusions This accelerated version of PaPaRa (available at http://www.exelixis-lab.org/software.html provides a significant performance improvement that allows for analyzing larger datasets in less time. We observe that state-of-the-art SIMD and SIMT architectures deliver comparable performance for this dynamic programming kernel when the “competing programmer approach” is deployed. Finally, we show that overall performance can be substantially increased
Monte Carlo methods for neutron transport on graphics processing units using Cuda - 015

International Nuclear Information System (INIS)

Nelson, A.G.; Ivanov, K.N.

2010-01-01

This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to accelerate Monte Carlo neutron transport simulations. First, a clean-sheet MC code was written in C++ for an x86 CPU and later ported to run on GPUs using NVIDIA's CUDA programming language. After further optimization, the GPU ran 21 times faster than the CPU code when using single-precision floating point math. This can be further increased with no additional effort if accuracy is sacrificed for speed: using a compiler flag, the speedup was increased to 22x. Further, if double-precision floating point math is desired for neutron tracking through the geometry, a speedup of 11x was obtained. The GPUs have proven to be useful in this study, but the current generation does have limitations: the maximum memory currently available on a single GPU is only 4 GB; the GPU RAM does not provide error-checking and correction; and the optimization required for large speedups can lead to confusing code. (authors)
A high throughput data acquisition and processing model for applications based on GPUs

International Nuclear Information System (INIS)

Nieto, J.; Arcas, G. de; Ruiz, M.; Castro, R.; Vega, J.; Guillen, P.

2015-01-01

Highlights: • Implementation of a direct communication path between a data acquisition NI FlexRIO device and a NVIDIA GPU device. • Customization of a Linux Kernel Open Driver (NI FlexRIO) and a C API Interface for work con NVIDIA RDMA GPUDirect. • Performance evaluation with respect to traditional model that use CPU as buffer data allocation. - Abstract: There is an increasing interest in the use of GPU technologies for real time analysis in fusion devices. The availability of high bandwidth interfaces has made them a very cost effective alternative not only for high volume data analysis or simulation, and commercial products are available for some interest areas. However from the point of view of their application in real time scenarios, there are still some issues under analysis, such as the possibility to improve the data throughput inside a discrete system consisting of data acquisition devices (DAQ) and GPUs. This paper addresses the possibility of using peer to peer data communication between DAQ devices and GPUs sharing the same PCIexpress bus to implement continuous real time acquisition and processing systems where data transfers require minimum CPU intervention. This technology eliminates unnecessary system memory copies and lowers CPU overhead, avoiding bottleneck when the system uses the main system memory.
A high throughput data acquisition and processing model for applications based on GPUs

Energy Technology Data Exchange (ETDEWEB)

Nieto, J., E-mail: jnieto@sec.upm.es [Instrumentation and Applied Acoustic Research Group, Technical University of Madrid (UPM), Madrid (Spain); Arcas, G. de; Ruiz, M. [Instrumentation and Applied Acoustic Research Group, Technical University of Madrid (UPM), Madrid (Spain); Castro, R.; Vega, J. [Data acquisition Group EURATOM/CIEMAT Association for Fusion, Madrid (Spain); Guillen, P. [Instrumentation and Applied Acoustic Research Group, Technical University of Madrid (UPM), Madrid (Spain)

2015-10-15

Highlights: • Implementation of a direct communication path between a data acquisition NI FlexRIO device and a NVIDIA GPU device. • Customization of a Linux Kernel Open Driver (NI FlexRIO) and a C API Interface for work con NVIDIA RDMA GPUDirect. • Performance evaluation with respect to traditional model that use CPU as buffer data allocation. - Abstract: There is an increasing interest in the use of GPU technologies for real time analysis in fusion devices. The availability of high bandwidth interfaces has made them a very cost effective alternative not only for high volume data analysis or simulation, and commercial products are available for some interest areas. However from the point of view of their application in real time scenarios, there are still some issues under analysis, such as the possibility to improve the data throughput inside a discrete system consisting of data acquisition devices (DAQ) and GPUs. This paper addresses the possibility of using peer to peer data communication between DAQ devices and GPUs sharing the same PCIexpress bus to implement continuous real time acquisition and processing systems where data transfers require minimum CPU intervention. This technology eliminates unnecessary system memory copies and lowers CPU overhead, avoiding bottleneck when the system uses the main system memory.
Statistical significance estimation of a signal within the GooFit framework on GPUs

Directory of Open Access Journals (Sweden)

Cristella Leonardo

2017-01-01

Full Text Available In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψϕ invariant mass in the three-body decay B+ → J/ψϕK+. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerable resulting speed-up, evident when comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may or may not apply because its regularity conditions are not satisfied.
Fault-tolerant architecture: Evaluation methodology

International Nuclear Information System (INIS)

Battle, R.E.; Kisner, R.A.

1992-08-01

The design and reliability of four fault-tolerant architectures that may be used in nuclear power plant control systems were evaluated. Two architectures are variations of triple-modular-redundant (TMR) systems, and two are variations of dual redundant systems. The evaluation includes a review of methods of implementing fault-tolerant control, the importance of automatic recovery from failures, methods of self-testing diagnostics, block diagrams of typical fault-tolerant controllers, review of fault-tolerant controllers operating in nuclear power plants, and fault tree reliability analyses of fault-tolerant systems
GRAPHICS PROCESSING UNITS: MORE THAN THE PATHWAY TO REALISTIC VIDEO-GAMES

Directory of Open Access Journals (Sweden)

CARLOS TRUJILLO

2011-01-01

Full Text Available El amplio mercado de los juegos de video ha impulsado un acelerado progreso del hardware y software orientado a lograr ambientes de juego de mayor realidad. Entre estos desarrollos se cuentan las unidades de procesamiento gráfico (GPU, cuyo objetivo es liberar la unidad de procesamiento principal (CPU de los elaborados cómputos que proporcionan "vida" a los juegos de video. Para lograrlo, las GPUs son equipadas con múltiples núcleos de procesamiento operando en paralelo, lo cual permite utilizarlas en tareas mucho más diversas que el desarrollo de juegos de video. En este artículo se presenta una breve descripción de las características de compute unified device architecture (CUDA TM, una arquitectura de cómputo paralelo en GPUs. Se presenta una aplicación de esta arquitectura en la reconstrucción numérica de hologramas, para la cual se reporta una aceleración de 11X con respecto al desempeño alcanzado en una CPU.
Implementing Non Power-of-Two FFTs on Coarse-Grain Reconfigurable Architectures

NARCIS (Netherlands)

Rivaton, Arnaud; Quevremont, Jérôme; Zhang, Q.; Wolkotte, P.T.; Smit, Gerardus Johannes Maria; Nurmi, J.; Takala, J.; Hamalainen, T.D.

2005-01-01

To improve power figures of a dual ARM9 RISC core architecture targeting low-power digital broadcasting applications, the addition of a coarse-grain architecture is considered. This paper introduces two of these structures: PACT's XPP technology and the Montium, developed by the University of
Coherent laser radar with dual-frequency Doppler estimation and interferometric range detection

NARCIS (Netherlands)

Onori, D.; Scotti, F.; Laghezza, F.; Scaffardi, M.; Bogoni, A.

2016-01-01

The concept of a coherent interferometric dual frequency laser radar, that measures both the target range and velocity, is presented and experimentally demonstrated. The innovative architecture combines the dual frequency lidar concept, allowing a precise and robust Doppler estimation, with the
A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

International Nuclear Information System (INIS)

Liu, T.; Du, X.; Ji, W.; Xu, G.; Brown, F.B.

2013-01-01

For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed. (authors)
GPU-accelerated 3-D model-based tracking

International Nuclear Information System (INIS)

Brown, J Anthony; Capson, David W

2010-01-01

Model-based approaches to tracking the pose of a 3-D object in video are effective but computationally demanding. While statistical estimation techniques, such as the particle filter, are often employed to minimize the search space, real-time performance remains unachievable on current generation CPUs. Recent advances in graphics processing units (GPUs) have brought massively parallel computational power to the desktop environment and powerful developer tools, such as NVIDIA Compute Unified Device Architecture (CUDA), have provided programmers with a mechanism to exploit it. NVIDIA GPUs' single-instruction multiple-thread (SIMT) programming model is well-suited to many computer vision tasks, particularly model-based tracking, which requires several hundred 3-D model poses to be dynamically configured, rendered, and evaluated against each frame in the video sequence. Using 6 degree-of-freedom (DOF) rigid hand tracking as an example application, this work harnesses consumer-grade GPUs to achieve real-time, 3-D model-based, markerless object tracking in monocular video.
A comparative study of history-based versus vectorized Monte Carlo methods in the GPU/CUDA environment for a simple neutron eigenvalue problem

Science.gov (United States)

Liu, Tianyu; Du, Xining; Ji, Wei; Xu, X. George; Brown, Forrest B.

2014-06-01

For nuclear reactor analysis such as the neutron eigenvalue calculations, the time consuming Monte Carlo (MC) simulations can be accelerated by using graphics processing units (GPUs). However, traditional MC methods are often history-based, and their performance on GPUs is affected significantly by the thread divergence problem. In this paper we describe the development of a newly designed event-based vectorized MC algorithm for solving the neutron eigenvalue problem. The code was implemented using NVIDIA's Compute Unified Device Architecture (CUDA), and tested on a NVIDIA Tesla M2090 GPU card. We found that although the vectorized MC algorithm greatly reduces the occurrence of thread divergence thus enhancing the warp execution efficiency, the overall simulation speed is roughly ten times slower than the history-based MC code on GPUs. Profiling results suggest that the slow speed is probably due to the memory access latency caused by the large amount of global memory transactions. Possible solutions to improve the code efficiency are discussed.
A Study Effects Architectural Marketing Capabilities on Performance Marketing unit Based on: Morgan et al case: Past Industry in Tehran

OpenAIRE

Mohammad Reza Dalvi; Robabe Seifi

2014-01-01

Over a period of time architectural marketing capabilities combination of knowledge and skills develop in to capabilities. These architectural marketing capabilities have been identified as one of the important ways firms can achieve a competitive advantage The following research tests effects architectural marketing capabilities on performance marketing unit Based on a survey .a structural equation model was developed to test our hypotheses. the study develops a structural model linking arch...
A Parallel Algebraic Multigrid Solver on Graphics Processing Units

KAUST Repository

Haase, Gundolf

2010-01-01

The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. © 2010 Springer-Verlag.
Dual-Stack Single-Radio Communication Architecture for UAV Acting As a Mobile Node to Collect Data in WSNs

Directory of Open Access Journals (Sweden)

Ali Sayyed

2015-09-01

Full Text Available The use of mobile nodes to collect data in a Wireless Sensor Network (WSN has gained special attention over the last years. Some researchers explore the use of Unmanned Aerial Vehicles (UAVs as mobile node for such data-collection purposes. Analyzing these works, it is apparent that mobile nodes used in such scenarios are typically equipped with at least two different radio interfaces. The present work presents a Dual-Stack Single-Radio Communication Architecture (DSSRCA, which allows a UAV to communicate in a bidirectional manner with a WSN and a Sink node. The proposed architecture was specifically designed to support different network QoS requirements, such as best-effort and more reliable communications, attending both UAV-to-WSN and UAV-to-Sink communications needs. DSSRCA was implemented and tested on a real UAV, as detailed in this paper. This paper also includes a simulation analysis that addresses bandwidth consumption in an environmental monitoring application scenario. It includes an analysis of the data gathering rate that can be achieved considering different UAV flight speeds. Obtained results show the viability of using a single radio transmitter for collecting data from the WSN and forwarding such data to the Sink node.
Software architecture for a multi-purpose real-time control unit for research purposes

Science.gov (United States)

Epple, S.; Jung, R.; Jalba, K.; Nasui, V.

2017-05-01

A new, freely programmable, scalable control system for academic research purposes was developed. The intention was, to have a control unit capable of handling multiple PT1000 temperature sensors at reasonable accuracy and temperature range, as well as digital input signals and providing powerful output signals. To take full advantage of the system, control-loops are run in real time. The whole eight bit system with very limited memory runs independently of a personal computer. The two on board RS232 connectors allow to connect further units or to connect other equipment, as required in real time. This paper describes the software architecture for the third prototype that now provides stable measurements and an improvement in accuracy compared to the previous designs. As test case a thermal solar system to produce hot tap water and assist heating in a single-family house was implemented. The solar fluid pump was power-controlled and several temperatures at different points in the hydraulic system were measured and used in the control algorithms. The software architecture proved suitable to test several different control strategies and their corresponding algorithms for the thermal solar system.
Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

Directory of Open Access Journals (Sweden)

Jinwei Wang

2014-01-01

Full Text Available The active appearance model (AAM is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA on the Nvidia’s GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.
Graphics Processors in HEP Low-Level Trigger Systems

International Nuclear Information System (INIS)

Ammendola, Roberto; Biagioni, Andrea; Chiozzi, Stefano; Ramusino, Angelo Cotta; Cretaro, Paolo; Lorenzo, Stefano Di; Fantechi, Riccardo; Fiorini, Massimiliano; Frezza, Ottorino; Lamanna, Gianluca; Cicero, Francesca Lo; Lonardo, Alessandro; Martinelli, Michele; Neri, Ilaria; Paolucci, Pier Stanislao; Pastorelli, Elena; Piandani, Roberto; Pontisso, Luca; Rossetti, Davide; Simula, Francesco; Sozzi, Marco; Vicini, Piero

2016-01-01

Usage of Graphics Processing Units (GPUs) in the so called general-purpose computing is emerging as an effective approach in several fields of science, although so far applications have been employing GPUs typically for offline computations. Taking into account the steady performance increase of GPU architectures in terms of computing power and I/O capacity, the real-time applications of these devices can thrive in high-energy physics data acquisition and trigger systems. We will examine the use of online parallel computing on GPUs for the synchronous low-level trigger, focusing on tests performed on the trigger system of the CERN NA62 experiment. To successfully integrate GPUs in such an online environment, latencies of all components need analysing, networking being the most critical. To keep it under control, we envisioned NaNet, an FPGA-based PCIe Network Interface Card (NIC) enabling GPUDirect connection. Furthermore, it is assessed how specific trigger algorithms can be parallelized and thus benefit from a GPU implementation, in terms of increased execution speed. Such improvements are particularly relevant for the foreseen Large Hadron Collider (LHC) luminosity upgrade where highly selective algorithms will be essential to maintain sustainable trigger rates with very high pileup

Towards A New Opportunistic IoT Network Architecture for Wildlife Monitoring System

NARCIS (Netherlands)

Ayele, Eyuel Debebe; Meratnia, Nirvana; Havinga, Paul J.M.

In this paper we introduce an opportunistic dual radio IoT network architecture for wildlife monitoring systems (WMS). Since data processing consumes less energy than transmitting the raw data, the proposed architecture leverages opportunistic mobile networks in a fixed LPWAN IoT network
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Science.gov (United States)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Directory of Open Access Journals (Sweden)

Cerati Giuseppe

2017-01-01

Full Text Available For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU, ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC, for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Energy Technology Data Exchange (ETDEWEB)

Cerati, Giuseppe [Fermilab; Elmer, Peter [Princeton U.; Krutelyov, Slava [UC, San Diego; Lantz, Steven [Cornell U.; Lefebvre, Matthieu [Princeton U.; Masciovecchio, Mario [UC, San Diego; McDermott, Kevin [Cornell U.; Riley, Daniel [Cornell U., LNS; Tadel, Matevž [UC, San Diego; Wittich, Peter [Cornell U.; Würthwein, Frank [UC, San Diego; Yagil, Avi [UC, San Diego

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION

International Nuclear Information System (INIS)

Schneider, Evan E.; Robertson, Brant E.

2015-01-01

We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256 3 ) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density
CHOLLA: A NEW MASSIVELY PARALLEL HYDRODYNAMICS CODE FOR ASTROPHYSICAL SIMULATION

Energy Technology Data Exchange (ETDEWEB)

Schneider, Evan E.; Robertson, Brant E. [Steward Observatory, University of Arizona, 933 North Cherry Avenue, Tucson, AZ 85721 (United States)

2015-04-15

We present Computational Hydrodynamics On ParaLLel Architectures (Cholla ), a new three-dimensional hydrodynamics code that harnesses the power of graphics processing units (GPUs) to accelerate astrophysical simulations. Cholla models the Euler equations on a static mesh using state-of-the-art techniques, including the unsplit Corner Transport Upwind algorithm, a variety of exact and approximate Riemann solvers, and multiple spatial reconstruction techniques including the piecewise parabolic method (PPM). Using GPUs, Cholla evolves the fluid properties of thousands of cells simultaneously and can update over 10 million cells per GPU-second while using an exact Riemann solver and PPM reconstruction. Owing to the massively parallel architecture of GPUs and the design of the Cholla code, astrophysical simulations with physically interesting grid resolutions (≳256{sup 3}) can easily be computed on a single device. We use the Message Passing Interface library to extend calculations onto multiple devices and demonstrate nearly ideal scaling beyond 64 GPUs. A suite of test problems highlights the physical accuracy of our modeling and provides a useful comparison to other codes. We then use Cholla to simulate the interaction of a shock wave with a gas cloud in the interstellar medium, showing that the evolution of the cloud is highly dependent on its density structure. We reconcile the computed mixing time of a turbulent cloud with a realistic density distribution destroyed by a strong shock with the existing analytic theory for spherical cloud destruction by describing the system in terms of its median gas density.
Energy- and cost-efficient lattice-QCD computations using graphics processing units

Energy Technology Data Exchange (ETDEWEB)

Bach, Matthias

2014-07-01

Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD
Energy- and cost-efficient lattice-QCD computations using graphics processing units

International Nuclear Information System (INIS)

Bach, Matthias

2014-01-01

Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus - despite the complexity of LQCD applications - it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D kernel for a single GPU, achieving 120GFLOPS. D - the most compute intensive kernel in LQCD simulations - is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

International Nuclear Information System (INIS)

Levine, Benjamin G.; Stone, John E.; Kohlmeyer, Axel

2011-01-01

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple graphics processing units (GPUs). The algorithm features a tiling scheme to maximize the reuse of data at the fastest levels of the GPU's memory hierarchy and dynamic load balancing to allow high performance on heterogeneous configurations of GPUs. Several versions of the RDF algorithm are presented, utilizing the specific hardware features found on different generations of GPUs. We take advantage of larger shared memory and atomic memory operations available on state-of-the-art GPUs to accelerate the code significantly. The use of atomic memory operations allows the fast, limited-capacity on-chip memory to be used much more efficiently, resulting in a fivefold increase in performance compared to the version of the algorithm without atomic operations. The ultimate version of the algorithm running in parallel on four NVIDIA GeForce GTX 480 (Fermi) GPUs was found to be 92 times faster than a multithreaded implementation running on an Intel Xeon 5550 CPU. On this multi-GPU hardware, the RDF between two selections of 1,000,000 atoms each can be calculated in 26.9 s per frame. The multi-GPU RDF algorithms described here are implemented in VMD, a widely used and freely available software package for molecular dynamics visualization and analysis.
Initial Assessment of Parallelization of Monte Carlo Calculation using Graphics Processing Units

International Nuclear Information System (INIS)

Choi, Sung Hoon; Joo, Han Gyu

2009-01-01

Monte Carlo (MC) simulation is an effective tool for calculating neutron transports in complex geometry. However, because Monte Carlo simulates each neutron behavior one by one, it takes a very long computing time if enough neutrons are used for high precision of calculation. Accordingly, methods that reduce the computing time are required. In a Monte Carlo code, parallel calculation is well-suited since it simulates the behavior of each neutron independently and thus parallel computation is natural. The parallelization of the Monte Carlo codes, however, was done using multi CPUs. By the global demand for high quality 3D graphics, the Graphics Processing Unit (GPU) has developed into a highly parallel, multi-core processor. This parallel processing capability of GPUs can be available to engineering computing once a suitable interface is provided. Recently, NVIDIA introduced CUDATM, a general purpose parallel computing architecture. CUDA is a software environment that allows developers to manage GPU using C/C++ or other languages. In this work, a GPU-based Monte Carlo is developed and the initial assessment of it parallel performance is investigated
Dual Smarandache Curves and Smarandache Ruled Surfaces

OpenAIRE

Tanju KAHRAMAN; Mehmet ÖNDER; H. Hüseyin UGURLU

2013-01-01

In this paper, by considering dual geodesic trihedron (dual Darboux frame) we define dual Smarandache curves lying fully on dual unit sphere S^2 and corresponding to ruled surfaces. We obtain the relationships between the elements of curvature of dual spherical curve (ruled surface) x(s) and its dual Smarandache curve (Smarandache ruled surface) x1(s) and we give an example for dual Smarandache curves of a dual spherical curve.
Evaluation of the performance of combined cooling, heating, and power systems with dual power generation units

International Nuclear Information System (INIS)

Knizley, Alta A.; Mago, Pedro J.; Smith, Amanda D.

2014-01-01

The benefits of using a combined cooling, heating, and power system with dual power generation units (D-CCHP) is examined in nine different U.S. locations. One power generation unit (PGU) is operated at base load while the other is operated following the electric load. The waste heat from both PGUs is used for heating and for cooling via an absorption chiller. The D-CCHP configuration is studied for a restaurant benchmark building, and its performance is quantified in terms of operational cost, primary energy consumption (PEC), and carbon dioxide emissions (CDE). Cost spark spread, PEC spark spread, and CDE spark spread are examined as performance indicators for the D-CCHP system. D-CCHP system performance correlates well with spark spreads, with higher spark spreads signifying greater savings through implementation of a D-CCHP system. A new parameter, thermal difference, is introduced to investigate the relative performance of a D-CCHP system compared to a dual PGU combined heat and power system (D-CHP). Thermal difference, together with spark spread, can explain the variation in savings of a D-CCHP system over a D-CHP system for each location. The effect of carbon credits on operational cost savings with respect to the reference case is shown for selected locations. - Highlights: • We investigate benefits from using combined cooling, heating, and power systems. • A dual power generation unit configuration is considered for CCHP and CHP. • Spark spreads for cost, energy, and emissions correlate with potential savings. • Thermal difference parameter helps to explain variations in potential savings. • Carbon credits may increase cost savings where emissions savings are possible
Introduction to assembly of finite element methods on graphics processors

International Nuclear Information System (INIS)

Cecka, Cristopher; Lew, Adrian; Darve, Eric

2010-01-01

Recently, graphics processing units (GPUs) have had great success in accelerating numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30x or more in comparison to a well optimized serial implementation on the CPU. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite-element discretization.
HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

Science.gov (United States)

van Dyk, Danny; Geveler, Markus; Mallach, Sven; Ribbrock, Dirk; Göddeke, Dominik; Gutwenger, Carsten

2009-12-01

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summaryProgram title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the
An architecture for fault tolerant controllers

DEFF Research Database (Denmark)

Niemann, Hans Henrik; Stoustrup, Jakob

2005-01-01

degradation in the sense of guaranteed degraded performance. A number of fault diagnosis problems, fault tolerant control problems, and feedback control with fault rejection problems are formulated/considered, mainly from a fault modeling point of view. The method is illustrated on a servo example including......A general architecture for fault tolerant control is proposed. The architecture is based on the (primary) YJBK parameterization of all stabilizing compensators and uses the dual YJBK parameterization to quantify the performance of the fault tolerant system. The approach suggested can be applied...
A dual-mode proximity sensor with integrated capacitive and temperature sensing units

International Nuclear Information System (INIS)

Qiu, Shihua; Huang, Ying; He, Xiaoyue; Sun, Zhiguang; Liu, Ping; Liu, Caixia

2015-01-01

The proximity sensor is one of the most important devices in the field of robot application. It can accurately provide the proximity information to assistant robots to interact with human beings and the external environment safely. In this paper, we have proposed and demonstrated a dual-mode proximity sensor composed of capacitive and resistive sensing units. We defined the capacitive type proximity sensor perceiving the proximity information as C-mode and the resistive type proximity sensor detecting as R-mode. Graphene nanoplatelets (GNPs) were chosen as the R-mode sensing material because of its high performance. The dual-mode proximity sensor presents the following features: (1) the sensing distance of the dual-mode proximity sensor has been enlarged compared with the single capacitive proximity sensor in the same geometrical pattern; (2) experiments have verified that the proposed sensor can sense the proximity information of different materials; (3) the proximity sensing capability of the sensor has been improved by two modes perceive collaboratively, for a plastic block at a temperature of 60 °C: the R-mode will perceive the proximity information when the distance d between the sensor and object is 6.0–17.0 mm and the C-mode will do that when their interval is 0–2.0 mm; additionally two modes will work together when the distance is 2.0–6.0 mm. These features indicate our transducer is very valuable in skin-like sensing applications. (paper)
Graphics processing units accelerated semiclassical initial value representation molecular dynamics

Energy Technology Data Exchange (ETDEWEB)

Tamascelli, Dario; Dambrosio, Francesco Saverio [Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, 20133 Milano (Italy); Conte, Riccardo [Department of Chemistry and Cherry L. Emerson Center for Scientific Computation, Emory University, Atlanta, Georgia 30322 (United States); Ceotto, Michele, E-mail: michele.ceotto@unimi.it [Dipartimento di Chimica, Università degli Studi di Milano, via Golgi 19, 20133 Milano (Italy)

2014-05-07

This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20), respectively, versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.
Dual use of condoms with other contraceptive methods among adolescents and young women in the United States.

Science.gov (United States)

Tyler, Crystal P; Whiteman, Maura K; Kraft, Joan Marie; Zapata, Lauren B; Hillis, Susan D; Curtis, Kathryn M; Anderson, John; Pazol, Karen; Marchbanks, Polly A

2014-02-01

To estimate the prevalence of and factors associated with dual method use (i.e., condom with hormonal contraception or an intrauterine device) among adolescents and young women in the United States. We used 2006-2010 National Survey of Family Growth data from 2,093 unmarried females aged 15-24 years and at risk for unintended pregnancy. Using multivariable logistic regression, we estimated adjusted odds ratios (aORs) and 95% confidence intervals (CIs) to assess the associations between dual method use at last sex and sociodemographic, behavioral, reproductive history, and sexual behavior factors. At last sex, 20.7% of adolescents and young women used dual methods, 34.4% used condoms alone, 29.1% used hormonal contraception or an intrauterine device alone, and 15.8% used another method or no method. Factors associated with decreased odds of dual method use versus dual method nonuse included having a previous pregnancy (aOR = .44, 95% CI .27-.69), not having health insurance coverage over the past 12 months (aOR = .41, 95% CI .19-.91), and having sex prior to age 16 (aOR = .49, 95% CI .30-.78). The prevalence of dual method use is low among adolescents and young women. Adolescents and young women who may have a higher risk of pregnancy and sexually transmitted infections (e.g., those with a previous pregnancy) were less likely to use dual methods at last sex. Interventions are needed to increase the correct and consistent use of dual methods among adolescents and young women who may be at greater risk for unintended pregnancy and sexually transmitted infections. Published by Elsevier Inc.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments

Directory of Open Access Journals (Sweden)

Jyh-Da Wei

2017-08-01

Full Text Available High-end graphics processing units (GPUs, such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1, which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs. Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform. Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.
Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments.

Science.gov (United States)

Wei, Jyh-Da; Cheng, Hui-Jun; Lin, Chun-Yuan; Ye, Jin; Yeh, Kuan-Yu

2017-01-01

High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments.

Edge-preserving image denoising via group coordinate descent on the GPU

OpenAIRE

McGaffin, Madison G.; Fessler, Jeffrey A.

2015-01-01

Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This pape...
Enterprise architecture evaluation using architecture framework and UML stereotypes

Directory of Open Access Journals (Sweden)

Narges Shahi

2014-08-01

Full Text Available There is an increasing need for enterprise architecture in numerous organizations with complicated systems with various processes. Support for information technology, organizational units whose elements maintain complex relationships increases. Enterprise architecture is so effective that its non-use in organizations is regarded as their institutional inability in efficient information technology management. The enterprise architecture process generally consists of three phases including strategic programing of information technology, enterprise architecture programing and enterprise architecture implementation. Each phase must be implemented sequentially and one single flaw in each phase may result in a flaw in the whole architecture and, consequently, in extra costs and time. If a model is mapped for the issue and then it is evaluated before enterprise architecture implementation in the second phase, the possible flaws in implementation process are prevented. In this study, the processes of enterprise architecture are illustrated through UML diagrams, and the architecture is evaluated in programming phase through transforming the UML diagrams to Petri nets. The results indicate that the high costs of the implementation phase will be reduced.
Solving Matrix Equations on Multi-Core and Many-Core Architectures

Directory of Open Access Journals (Sweden)

Peter Benner

2013-11-01

Full Text Available We address the numerical solution of Lyapunov, algebraic and differential Riccati equations, via the matrix sign function, on platforms equipped with general-purpose multicore processors and, optionally, one or more graphics processing units (GPUs. In particular, we review the solvers for these equations, as well as the underlying methods, analyze their concurrency and scalability and provide details on their parallel implementation. Our experimental results show that this class of hardware provides sufficient computational power to tackle large-scale problems, which only a few years ago would have required a cluster of computers.
Computation studies into architecture and energy transfer properties of photosynthetic units from filamentous anoxygenic phototrophs

Energy Technology Data Exchange (ETDEWEB)

Linnanto, Juha Matti [Institute of Physics, University of Tartu, Riia 142, 51014 Tartu (Estonia); Freiberg, Arvi [Institute of Physics, University of Tartu, Riia 142, 51014 Tartu, Estonia and Institute of Molecular and Cell Biology, University of Tartu, Riia 23, 51010 Tartu (Estonia)

2014-10-06

We have used different computational methods to study structural architecture, and light-harvesting and energy transfer properties of the photosynthetic unit of filamentous anoxygenic phototrophs. Due to the huge number of atoms in the photosynthetic unit, a combination of atomistic and coarse methods was used for electronic structure calculations. The calculations reveal that the light energy absorbed by the peripheral chlorosome antenna complex transfers efficiently via the baseplate and the core B808–866 antenna complexes to the reaction center complex, in general agreement with the present understanding of this complex system.
CURING EFFICIENCY OF DUAL-CURE RESIN CEMENT UNDER ZIRCONIA WITH TWO DIFFERENT LIGHT CURING UNITS

Directory of Open Access Journals (Sweden)

Pınar GÜLTEKİN

2015-04-01

Full Text Available Purpose: Adequate polymerization is a crucial factor in obtaining optimal physical properties and a satisfying clinical performance from composite resin materials. The aim of this study was to evaluate the polymerization efficiency of dual-cure resin cement cured with two different light curing units under zirconia structures having differing thicknesses. Materials and Methods: 4 zirconia discs framework in 4 mm diameter and in 0.5 mm, 1 mm and 1.5 mm thickness were prepared using computer-aided design system. One of the 0.5 mm-thick substructures was left as mono-layered whereas others were layered with feldspathic porcelain of same thickness and ceramic samples with 4 different thicknesses (0.5, 1, 1.5 and 2.0 mm were prepared. For each group (n=12 resin cement was light cured in polytetrafluoroethylene molds using Light Emitting Diode (LED or Quartz-Tungsten Halogen (QHT light curing units under each of 4 zirconia based discs (n=96. The values of depth of cure (in mm and the Vickers Hardness Number values (VHN were evaluated for each specimen. Results: The use of LED curing unit produced a greater depth of cure compared to QTH under ceramic discs with 0.5 and 1 mm thickness (p<0.05.At 100μm and 300 μm depth, the LED unit produced significantly greater VHN values compared to the QTH unit (p<0.05. At 500 μm depth, the difference between the VHN values of LED and QTH groups were not statistically significant. Conclusion: Light curing may not result in adequate resin cement polymerization under thick zirconia structures. LED light sources should be preferred over QTH for curing dual-cure resin cements, especially for those under thicker zirconia restorations.
Design of Dual-Mode Local Oscillators Using CMOS Technology for Motion Detection Sensors.

Science.gov (United States)

Ha, Keum-Won; Lee, Jeong-Yun; Kim, Jeong-Geun; Baek, Donghyun

2018-04-01

Recently, studies have been actively carried out to implement motion detecting sensors by applying radar techniques. Doppler radar or frequency-modulated continuous wave (FMCW) radar are mainly used, but each type has drawbacks. In Doppler radar, no signal is detected when the movement is stopped. Also, FMCW radar cannot function when the detection object is near the sensor. Therefore, by implementing a single continuous wave (CW) radar for operating in dual-mode, the disadvantages in each mode can be compensated for. In this paper, a dual mode local oscillator (LO) is proposed that makes a CW radar operate as a Doppler or FMCW radar. To make the dual-mode LO, a method that controls the division ratio of the phase locked loop (PLL) is used. To support both radar mode easily, the proposed LO is implemented by adding a frequency sweep generator (FSG) block to a fractional-N PLL. The operation mode of the LO is determined by according to whether this block is operating or not. Since most radar sensors are used in conjunction with microcontroller units (MCUs), the proposed architecture is capable of dual-mode operation by changing only the input control code. In addition, all components such as VCO, LDO, and loop filter are integrated into the chip, so complexity and interface issues can be solved when implementing radar sensors. Thus, the proposed dual-mode LO is suitable as a radar sensor.
Design of Dual-Mode Local Oscillators Using CMOS Technology for Motion Detection Sensors

Directory of Open Access Journals (Sweden)

Keum-Won Ha

2018-04-01

Full Text Available Recently, studies have been actively carried out to implement motion detecting sensors by applying radar techniques. Doppler radar or frequency-modulated continuous wave (FMCW radar are mainly used, but each type has drawbacks. In Doppler radar, no signal is detected when the movement is stopped. Also, FMCW radar cannot function when the detection object is near the sensor. Therefore, by implementing a single continuous wave (CW radar for operating in dual-mode, the disadvantages in each mode can be compensated for. In this paper, a dual mode local oscillator (LO is proposed that makes a CW radar operate as a Doppler or FMCW radar. To make the dual-mode LO, a method that controls the division ratio of the phase locked loop (PLL is used. To support both radar mode easily, the proposed LO is implemented by adding a frequency sweep generator (FSG block to a fractional-N PLL. The operation mode of the LO is determined by according to whether this block is operating or not. Since most radar sensors are used in conjunction with microcontroller units (MCUs, the proposed architecture is capable of dual-mode operation by changing only the input control code. In addition, all components such as VCO, LDO, and loop filter are integrated into the chip, so complexity and interface issues can be solved when implementing radar sensors. Thus, the proposed dual-mode LO is suitable as a radar sensor.
A Programming Model for Massive Data Parallelism with Data Dependencies

International Nuclear Information System (INIS)

Cui, Xiaohui; Mueller, Frank; Potok, Thomas E.; Zhang, Yongpeng

2009-01-01

Accelerating processors can often be more cost and energy effective for a wide range of data-parallel computing problems than general-purpose processors. For graphics processor units (GPUs), this is particularly the case when program development is aided by environments such as NVIDIA s Compute Unified Device Architecture (CUDA), which dramatically reduces the gap between domain-specific architectures and general purpose programming. Nonetheless, general-purpose GPU (GPGPU) programming remains subject to several restrictions. Most significantly, the separation of host (CPU) and accelerator (GPU) address spaces requires explicit management of GPU memory resources, especially for massive data parallelism that well exceeds the memory capacity of GPUs. One solution to this problem is to transfer data between the GPU and host memories frequently. In this work, we investigate another approach. We run massively data-parallel applications on GPU clusters. We further propose a programming model for massive data parallelism with data dependencies for this scenario. Experience from micro benchmarks and real-world applications shows that our model provides not only ease of programming but also significant performance gains
TensorFlow: A system for large-scale machine learning

OpenAIRE

Abadi, Martín; Barham, Paul; Chen, Jianmin; Chen, Zhifeng; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Irving, Geoffrey; Isard, Michael; Kudlur, Manjunath; Levenberg, Josh; Monga, Rajat; Moore, Sherry; Murray, Derek G.

2016-01-01

TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexib...
A Generic High-performance GPU-based Library for PDE solvers

DEFF Research Database (Denmark)

Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter

, the privilege of high-performance parallel computing is now in principle accessible for many scientific users, no matter their economic resources. Though being highly effective units, GPUs and parallel architectures in general, pose challenges for software developers to utilize their efficiency. Sequential...... legacy codes are not always easily parallelized and the time spent on conversion might not pay o in the end. We present a highly generic C++ library for fast assembling of partial differential equation (PDE) solvers, aiming at utilizing the computational resources of GPUs. The library requires a minimum...... of GPU computing knowledge, while still oering the possibility to customize user-specic solvers at kernel level if desired. Spatial dierential operators are based on matrix free exible order nite dierence approximations. These matrix free operators minimize both memory consumption and main memory access...
Minimalism in architecture: Architecture as a language of its identity

Directory of Open Access Journals (Sweden)

Vasilski Dragana

2012-01-01

Full Text Available Every architectural work is created on the principle that includes the meaning, and then this work is read like an artifact of the particular meaning. Resources by which the meaning is built primarily, susceptible to transformation, as well as routing of understanding (decoding messages carried by a work of architecture, are subject of semiotics and communication theories, which have played significant role for the architecture and the architect. Minimalism in architecture, as a paradigm of the XXI century architecture, means searching for essence located in the irreducible minimum. Inspired use of architectural units (archetypical elements, trough the fatasm of simplicity, assumes the primary responsibility for providing the object identity, because it participates in language formation and therefore in its reading. Volume is form by clean language that builds the expression of the fluid areas liberated of recharge needs. Reduced architectural language is appropriating to the age marked by electronic communications.
State-of-the-art in Heterogeneous Computing

Directory of Open Access Journals (Sweden)

Andre R. Brodtkorb

2010-01-01

Full Text Available Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs, and field programmable gate arrays (FPGAs. We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
Implementation of collisions on GPU architecture in the Vorpal code

Science.gov (United States)

Leddy, Jarrod; Averkin, Sergey; Cowan, Ben; Sides, Scott; Werner, Greg; Cary, John

2017-10-01

The Vorpal code contains a variety of collision operators allowing for the simulation of plasmas containing multiple charge species interacting with neutrals, background gas, and EM fields. These existing algorithms have been improved and reimplemented to take advantage of the massive parallelization allowed by GPU architecture. The use of GPUs is most effective when algorithms are single-instruction multiple-data, so particle collisions are an ideal candidate for this parallelization technique due to their nature as a series of independent processes with the same underlying operation. This refactoring required data memory reorganization and careful consideration of device/host data allocation to minimize memory access and data communication per operation. Successful implementation has resulted in an order of magnitude increase in simulation speed for a test-case involving multiple binary collisions using the null collision method. Work supported by DARPA under contract W31P4Q-16-C-0009.
Micromagnetic simulations using Graphics Processing Units

International Nuclear Information System (INIS)

Lopez-Diaz, L; Aurelio, D; Torres, L; Martinez, E; Hernandez-Lopez, M A; Gomez, J; Alejos, O; Carpentieri, M; Finocchio, G; Consolo, G

2012-01-01

The methodology for adapting a standard micromagnetic code to run on graphics processing units (GPUs) and exploit the potential for parallel calculations of this platform is discussed. GPMagnet, a general purpose finite-difference GPU-based micromagnetic tool, is used as an example. Speed-up factors of two orders of magnitude can be achieved with GPMagnet with respect to a serial code. This allows for running extensive simulations, nearly inaccessible with a standard micromagnetic solver, at reasonable computational times. (topical review)
Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

Energy Technology Data Exchange (ETDEWEB)

Beckingsale, D. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Gaudin, W. P. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Hornung, R. D. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Gunney, B. T. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Gamblin, T. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Herdman, J. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom); Jarvis, S. A. [Atomic Weapons Establishment (AWE), Aldermaston (United Kingdom)

2014-11-17

Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA.
Current and Future Development of a Non-hydrostatic Unified Atmospheric Model (NUMA)

Science.gov (United States)

2010-09-09

following capabilities: 1.  Highly scalable on current and future computer architectures ( exascale computing and beyond and GPUs) 2.  Flexibility... Exascale Computing •  10 of Top 500 are already in the Petascale range •  Should also keep our eyes on GPUs (e.g., Mare Nostrum) 2.  Numerical
Architecture of security management unit for safe hosting of multiple agents

Science.gov (United States)

Gilmont, Tanguy; Legat, Jean-Didier; Quisquater, Jean-Jacques

1999-04-01

In such growing areas as remote applications in large public networks, electronic commerce, digital signature, intellectual property and copyright protection, and even operating system extensibility, the hardware security level offered by existing processors is insufficient. They lack protection mechanisms that prevent the user from tampering critical data owned by those applications. Some devices make exception, but have not enough processing power nor enough memory to stand up to such applications (e.g. smart cards). This paper proposes an architecture of secure processor, in which the classical memory management unit is extended into a new security management unit. It allows ciphered code execution and ciphered data processing. An internal permanent memory can store cipher keys and critical data for several client agents simultaneously. The ordinary supervisor privilege scheme is replaced by a privilege inheritance mechanism that is more suited to operating system extensibility. The result is a secure processor that has hardware support for extensible multitask operating systems, and can be used for both general applications and critical applications needing strong protection. The security management unit and the internal permanent memory can be added to an existing CPU core without loss of performance, and do not require it to be modified.
Accelerating large-scale protein structure alignments with graphics processing units

Directory of Open Access Journals (Sweden)

Pang Bin

2012-02-01

Full Text Available Abstract Background Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons. Findings We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs. As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH. Conclusions ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU.
Efficient particle-in-cell simulation of auroral plasma phenomena using a CUDA enabled graphics processing unit

Science.gov (United States)

Sewell, Stephen

This thesis introduces a software framework that effectively utilizes low-cost commercially available Graphic Processing Units (GPUs) to simulate complex scientific plasma phenomena that are modeled using the Particle-In-Cell (PIC) paradigm. The software framework that was developed conforms to the Compute Unified Device Architecture (CUDA), a standard for general purpose graphic processing that was introduced by NVIDIA Corporation. This framework has been verified for correctness and applied to advance the state of understanding of the electromagnetic aspects of the development of the Aurora Borealis and Aurora Australis. For each phase of the PIC methodology, this research has identified one or more methods to exploit the problem's natural parallelism and effectively map it for execution on the graphic processing unit and its host processor. The sources of overhead that can reduce the effectiveness of parallelization for each of these methods have also been identified. One of the novel aspects of this research was the utilization of particle sorting during the grid interpolation phase. The final representation resulted in simulations that executed about 38 times faster than simulations that were run on a single-core general-purpose processing system. The scalability of this framework to larger problem sizes and future generation systems has also been investigated.
Assembly of finite element methods on graphics processors

KAUST Repository

Cecka, Cris

2010-08-23

Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are created and analyzed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor using single-precision arithmetic achieves speedups of 30 or more in comparison to a well optimized double-precision single core implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite element discretization. © 2010 John Wiley & Sons, Ltd.

Molecular Monte Carlo Simulations Using Graphics Processing Units: To Waste Recycle or Not?

Science.gov (United States)

Kim, Jihan; Rodgers, Jocelyn M; Athènes, Manuel; Smit, Berend

2011-10-11

In the waste recycling Monte Carlo (WRMC) algorithm, (1) multiple trial states may be simultaneously generated and utilized during Monte Carlo moves to improve the statistical accuracy of the simulations, suggesting that such an algorithm may be well posed for implementation in parallel on graphics processing units (GPUs). In this paper, we implement two waste recycling Monte Carlo algorithms in CUDA (Compute Unified Device Architecture) using uniformly distributed random trial states and trial states based on displacement random-walk steps, and we test the methods on a methane-zeolite MFI framework system to evaluate their utility. We discuss the specific implementation details of the waste recycling GPU algorithm and compare the methods to other parallel algorithms optimized for the framework system. We analyze the relationship between the statistical accuracy of our simulations and the CUDA block size to determine the efficient allocation of the GPU hardware resources. We make comparisons between the GPU and the serial CPU Monte Carlo implementations to assess speedup over conventional microprocessors. Finally, we apply our optimized GPU algorithms to the important problem of determining free energy landscapes, in this case for molecular motion through the zeolite LTA.
Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

KAUST Repository

Halim Boukaram, Wajih

2017-09-14

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.
Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

KAUST Repository

Halim Boukaram, Wajih; Turkiyyah, George; Ltaief, Hatem; Keyes, David E.

2017-01-01

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.
Real-time processing for full-range Fourier-domain optical-coherence tomography with zero-filling interpolation using multiple graphic processing units.

Science.gov (United States)

Watanabe, Yuuki; Maeno, Seiya; Aoshima, Kenji; Hasegawa, Haruyuki; Koseki, Hitoshi

2010-09-01

The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated. The required speed was achieved by using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing. We used a zero-filling technique, including a forward Fourier transform, a zero padding to increase the axial data-array size to 8192, an inverse-Fourier transform back to the spectral domain, a linear interpolation from wavelength to wavenumber, a lateral Hilbert transform to obtain the complex spectrum, a Fourier transform to obtain the axial profiles, and a log scaling. The data-transfer time of the frame grabber was 15.73?ms, and the processing time, which includes the data transfer between the GPU memory and the host computer, was 14.75?ms, for a total time shorter than the 36.70?ms frame-interval time using a line-scan CCD camera operated at 27.9?kHz. That is, our OCT system achieved a processed-image display rate of 27.23 frames/s.
MT-ADRES: Multithreading on Coarse-Grained Reconfigurable Architecture

DEFF Research Database (Denmark)

Wu, Kehuai; Kanstein, Andreas; Madsen, Jan

2007-01-01

The coarse-grained reconfigurable architecture ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) and its compiler offer high instruction-level parallelism (ILP) to applications by means of a sparsely interconnected array of functional units and register files. As high-ILP archi......The coarse-grained reconfigurable architecture ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) and its compiler offer high instruction-level parallelism (ILP) to applications by means of a sparsely interconnected array of functional units and register files. As high......-ILP architectures achieve only low parallelism when executing partially sequential code segments, which is also known as Amdahl’s law, this paper proposes to extend ADRES to MT-ADRES (Multi-Threaded ADRES) to also exploit thread-level parallelism. On MT-ADRES architectures, the array can be partitioned in multiple...
Computing OpenSURF on OpenCL and General Purpose GPU

Directory of Open Access Journals (Sweden)

Wanglong Yan

2013-10-01

Full Text Available Speeded-Up Robust Feature (SURF algorithm is widely used for image feature detecting and matching in computer vision area. Open Computing Language (OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. This paper introduces how to implement an open-sourced SURF program, namely OpenSURF, on general purpose GPU by OpenCL, and discusses the optimizations in terms of the thread architectures and memory models in detail. Our final OpenCL implementation of OpenSURF is on average 37% and 64% faster than the OpenCV SURF v2.4.5 CUDA implementation on NVidia's GTX660 and GTX460SE GPUs, repectively. Our OpenCL program achieved real-time performance (>25 Frames Per Second for almost all the input images with different sizes from 320*240 to 1024*768 on NVidia's GTX660 GPU, NVidia's GTX460SE GPU and AMD's Radeon HD 6850 GPU. Our OpenCL approach on NVidia's GTX660 GPU is more than 22.8 times faster than its original CPU version on Intel's Dual-Core E5400 2.7G on average.
Multi-Unit Initiating Event Analysis for a Single-Unit Internal Events Level 1 PSA

Energy Technology Data Exchange (ETDEWEB)

Kim, Dong San; Park, Jin Hee; Lim, Ho Gon [KAERI, Daejeon (Korea, Republic of)

2016-05-15

The Fukushima nuclear accident in 2011 highlighted the importance of considering the risks from multi-unit accidents at a site. The ASME/ANS probabilistic risk assessment (PRA) standard also includes some requirements related to multi-unit aspects, one of which (IE-B5) is as follows: 'For multi-unit sites with shared systems, DO NOT SUBSUME multi-unit initiating events if they impact mitigation capability [1].' However, the existing single-unit PSA models do not explicitly consider multi-unit initiating events and hence systems shared by multiple units (e.g., alternate AC diesel generator) are fully credited for the single unit and ignores the need for the shared systems by other units at the same site [2]. This paper describes the results of the multi-unit initiating event (IE) analysis performed as a part of the at-power internal events Level 1 probabilistic safety assessment (PSA) for an OPR1000 single unit ('reference unit'). In this study, a multi-unit initiating event analysis for a single-unit PSA was performed, and using the results, dual-unit LOOP initiating event was added to the existing PSA model for the reference unit (OPR1000 type). Event trees were developed for dual-unit LOOP and dual-unit SBO which can be transferred from dual- unit LOOP. Moreover, CCF basic events for 5 diesel generators were modelled. In case of simultaneous SBO occurrences in both units, this study compared two different assumptions on the availability of the AAC D/G. As a result, when dual-unit LOOP initiating event was added to the existing single-unit PSA model, the total CDF increased by 1∼ 2% depending on the probability that the AAC D/G is available to a specific unit in case of simultaneous SBO in both units.
Asynchronous Task-Based Polar Decomposition on Manycore Architectures

KAUST Repository

Sukkari, Dalal

2016-10-25

This paper introduces the first asynchronous, task-based implementation of the polar decomposition on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been severely weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors\\' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.
Re-Architecture: Reality or Utopia?

NARCIS (Netherlands)

Pereira-Roders, A.R.; Post, J.M.; Erkelens, P.A.

2008-01-01

RE-ARCHITECTURE: lifespan rehabilitation of built heritage (2004-2007) is a doctoral research funded by the Foundation for Science and Technology, Portugal; and hosted by the Unit Architectural Design and Engineering, Eindhoven University of Technology, The Netherlands. This doctoral research is now
A GPU-accelerated semi-implicit fractional step method for numerical solutions of incompressible Navier-Stokes equations

Science.gov (United States)

Ha, Sanghyun; Park, Junshin; You, Donghyun

2017-11-01

Utility of the computational power of modern Graphics Processing Units (GPUs) is elaborated for solutions of incompressible Navier-Stokes equations which are integrated using a semi-implicit fractional-step method. Due to its serial and bandwidth-bound nature, the present choice of numerical methods is considered to be a good candidate for evaluating the potential of GPUs for solving Navier-Stokes equations using non-explicit time integration. An efficient algorithm is presented for GPU acceleration of the Alternating Direction Implicit (ADI) and the Fourier-transform-based direct solution method used in the semi-implicit fractional-step method. OpenMP is employed for concurrent collection of turbulence statistics on a CPU while Navier-Stokes equations are computed on a GPU. Extension to multiple NVIDIA GPUs is implemented using NVLink supported by the Pascal architecture. Performance of the present method is experimented on multiple Tesla P100 GPUs compared with a single-core Xeon E5-2650 v4 CPU in simulations of boundary-layer flow over a flat plate. Supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (Ministry of Science, ICT and Future Planning NRF-2016R1E1A2A01939553, NRF-2014R1A2A1A11049599, and Ministry of Trade, Industry and Energy 201611101000230).
Potenciando el aprendizaje proactivo con ILIAS&WebQuest: aprendiendo a paralelizar algoritmos con GPUs

OpenAIRE

Santamaría, J.; Espinilla, M.; Rivera, A. J.; Romero, S.

2010-01-01

Arquitectura de Computadores es una asignatura troncal de segundo ciclo de la titulación de Ingeniería de Telecomunicación (P.E. 2004) de la Universidad de Jaén, que desde el curso académico 2009/10 cuenta con una metodología de aprendizaje proactivo para motivar al alumno en la realización de las prácticas. En concreto, se ha abordado la enseñanza de la materia de paralelización de algoritmos haciendo uso de GPUs de tarjetas gráficas convencionales. Además, se ha d...
Animação em tempo real de rugas faciais explorando as modernas GPUs

OpenAIRE

Clausius Duque Gonçalves Reis

2010-01-01

Resumo: A modelagem e animação de rugas faciais têm sido tarefas desafiadoras, devido à variedade de conformações e sutilezas de detalhes que as rugas podem exibir. Neste trabalho, são descritos dois métodos de apresentação de rugas em tempo real, utilizando as modernas GPUs. Ambos os métodos são baseados no uso de shaders em GPU e em uma abordagem de normal mapping para aplicar rugas sobre modelos virtuais. O primeiro método utiliza áreas de influência descritas por mapas de textura para cal...
Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Energy Technology Data Exchange (ETDEWEB)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.; Tallent, Nathan R.; Vishnu, Abhinav; Kerbyson, Darren J.

2017-07-03

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path. Our evaluation consists of a cross section of convolutional neural net workloads: CifarNet, CaffeNet, AlexNet and GoogleNet topologies using the Cifar10 and ImageNet datasets. The workloads are vendor optimized for each architecture. GPUs provide the highest overall raw performance. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and KNL can be competitive when considering performance/watt. Furthermore, NVLink is critical to GPU scaling.
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Energy Technology Data Exchange (ETDEWEB)

Dong, Fengguang [Univ. of Tennessee, Knoxville, TN (United States); Tomov, Stanimire [Univ. of Tennessee, Knoxville, TN (United States); Dongarra, Jack [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

2011-06-01

We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations e ciently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allocate data to the host system and GPUs to minimize communication. We have designed heterogeneous algorithms with two di erent tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scalability, strong scalability, load balance, and e ciency of our approach.
John Hejduk's Pursuit of an Architectural Ethos

DEFF Research Database (Denmark)

Søberg, Martin

2012-01-01

Reflected, artistic practices and design-based research are drastically expanding fields within architectural academia. However, the interest in uniting theory and practice is not entirely new. Just a few decades ago, before a ‘death of theory’ was proclaimed, questions of architectural epistemol......Reflected, artistic practices and design-based research are drastically expanding fields within architectural academia. However, the interest in uniting theory and practice is not entirely new. Just a few decades ago, before a ‘death of theory’ was proclaimed, questions of architectural...... epistemology, of the language(s) of architecture, were indeed of profound interest to the discipline. This essay returns to and examines the investigatory practices of John Hejduk in an attempt to identify a poetic method asserting difference through repetition and primarily grounded in the medium...
Dual Comb Unit High-g Accelerometer Based on CMOS-MEMS Technology

Directory of Open Access Journals (Sweden)

Mehrdad Mottaghi

2009-04-01

Full Text Available In this paper a capacitive based high-g accelerometer with superior level of sensitivity is presented. It takes advantage of dual comb unit configuration and surface micromachining fabrication process. All aspects of mechanical design such as sensor structure, modal analysis, energy dissipations, dynamic response and stresses in moving structure as well as anchors are described. Electrical circuit based on CMOS technology and its output signal is presented. Fabrication process and packaging are also discussed. The proposed sensor can endure impact loads up to 120,000 g (g = 9.81 m.s-2 and achieves 16.75 µV.g-1 sensitivity with 5 V bridge excitation voltage. Main resonant frequency of structure is found to be 42.4 kHz. Intended applications of suggested sensor include military and aerospace industries as well as field of impact engineering.
GPU Accelerated Vector Median Filter

Science.gov (United States)

Aras, Rifat; Shen, Yuzhong

2011-01-01

Noise reduction is an important step for most image processing tasks. For three channel color images, a widely used technique is vector median filter in which color values of pixels are treated as 3-component vectors. Vector median filters are computationally expensive; for a window size of n x n, each of the n(sup 2) vectors has to be compared with other n(sup 2) - 1 vectors in distances. General purpose computation on graphics processing units (GPUs) is the paradigm of utilizing high-performance many-core GPU architectures for computation tasks that are normally handled by CPUs. In this work. NVIDIA's Compute Unified Device Architecture (CUDA) paradigm is used to accelerate vector median filtering. which has to the best of our knowledge never been done before. The performance of GPU accelerated vector median filter is compared to that of the CPU and MPI-based versions for different image and window sizes, Initial findings of the study showed 100x improvement of performance of vector median filter implementation on GPUs over CPU implementations and further speed-up is expected after more extensive optimizations of the GPU algorithm .
Novel memory architecture for video signal processor

Science.gov (United States)

Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei

1993-11-01

An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.
Fabrication of scalable tissue engineering scaffolds with dual-pore microarchitecture by combining 3D printing and particle leaching

Energy Technology Data Exchange (ETDEWEB)

Mohanty, Soumyaranjan; Sanger, Kuldeep; Heiskanen, Arto [DTU Nanotech, Department of Micro- and Nanotechnology, Technical University of Denmark, Ørsteds Plads, DK-2800 Kgs. Lyngby (Denmark); Trifol, Jon; Szabo, Peter [Danish Polymer Centre, Department of Chemical and Biochemical Engineering, Søltofts Plads, Building 229, DK-2800 Kgs. Lyngby (Denmark); Dufva, Marin; Emnéus, Jenny [DTU Nanotech, Department of Micro- and Nanotechnology, Technical University of Denmark, Ørsteds Plads, DK-2800 Kgs. Lyngby (Denmark); Wolff, Anders, E-mail: anders.wolff@nanotech.dtu.dk [DTU Nanotech, Department of Micro- and Nanotechnology, Technical University of Denmark, Ørsteds Plads, DK-2800 Kgs. Lyngby (Denmark)

2016-04-01

Limitations in controlling scaffold architecture using traditional fabrication techniques are a problem when constructing engineered tissues/organs. Recently, integration of two pore architectures to generate dual-pore scaffolds with tailored physical properties has attracted wide attention in tissue engineering community. Such scaffolds features primary structured pores which can efficiently enhance nutrient/oxygen supply to the surrounding, in combination with secondary random pores, which give high surface area for cell adhesion and proliferation. Here, we present a new technique to fabricate dual-pore scaffolds for various tissue engineering applications where 3D printing of poly(vinyl alcohol) (PVA) mould is combined with salt leaching process. In this technique the sacrificial PVA mould, determining the structured pore architecture, was filled with salt crystals to define the random pore regions of the scaffold. After crosslinking the casted polymer the combined PVA-salt mould was dissolved in water. The technique has advantages over previously reported ones, such as automated assembly of the sacrificial mould, and precise control over pore architecture/dimensions by 3D printing parameters. In this study, polydimethylsiloxane and biodegradable poly(ϵ-caprolactone) were used for fabrication. However, we show that this technique is also suitable for other biocompatible/biodegradable polymers. Various physical and mechanical properties of the dual-pore scaffolds were compared with control scaffolds with either only structured or only random pores, fabricated using previously reported methods. The fabricated dual-pore scaffolds supported high cell density, due to the random pores, in combination with uniform cell distribution throughout the scaffold, and higher cell proliferation and viability due to efficient nutrient/oxygen transport through the structured pores. In conclusion, the described fabrication technique is rapid, inexpensive, scalable, and compatible
Fabrication of scalable tissue engineering scaffolds with dual-pore microarchitecture by combining 3D printing and particle leaching

International Nuclear Information System (INIS)

Mohanty, Soumyaranjan; Sanger, Kuldeep; Heiskanen, Arto; Trifol, Jon; Szabo, Peter; Dufva, Marin; Emnéus, Jenny; Wolff, Anders

2016-01-01

Limitations in controlling scaffold architecture using traditional fabrication techniques are a problem when constructing engineered tissues/organs. Recently, integration of two pore architectures to generate dual-pore scaffolds with tailored physical properties has attracted wide attention in tissue engineering community. Such scaffolds features primary structured pores which can efficiently enhance nutrient/oxygen supply to the surrounding, in combination with secondary random pores, which give high surface area for cell adhesion and proliferation. Here, we present a new technique to fabricate dual-pore scaffolds for various tissue engineering applications where 3D printing of poly(vinyl alcohol) (PVA) mould is combined with salt leaching process. In this technique the sacrificial PVA mould, determining the structured pore architecture, was filled with salt crystals to define the random pore regions of the scaffold. After crosslinking the casted polymer the combined PVA-salt mould was dissolved in water. The technique has advantages over previously reported ones, such as automated assembly of the sacrificial mould, and precise control over pore architecture/dimensions by 3D printing parameters. In this study, polydimethylsiloxane and biodegradable poly(ϵ-caprolactone) were used for fabrication. However, we show that this technique is also suitable for other biocompatible/biodegradable polymers. Various physical and mechanical properties of the dual-pore scaffolds were compared with control scaffolds with either only structured or only random pores, fabricated using previously reported methods. The fabricated dual-pore scaffolds supported high cell density, due to the random pores, in combination with uniform cell distribution throughout the scaffold, and higher cell proliferation and viability due to efficient nutrient/oxygen transport through the structured pores. In conclusion, the described fabrication technique is rapid, inexpensive, scalable, and compatible

Optimizing a High Energy Physics (HEP) Toolkit on Heterogeneous Architectures

CERN Document Server

Lindal, Yngve Sneen; Jarp, Sverre

2011-01-01

A desired trend within high energy physics is to increase particle accelerator luminosities, leading to production of more collision data and higher probabilities of finding interesting physics results. A central data analysis technique used to determine whether results are interesting or not is the maximum likelihood method, and the corresponding evaluation of the negative log-likelihood, which can be computationally expensive. As the amount of data grows, it is important to take benefit from the parallelism in modern computers. This, in essence, means to exploit vector registers and all available cores on CPUs, as well as utilizing co-processors as GPUs. This thesis describes the work done to optimize and parallelize a prototype of a central data analysis tool within the high energy physics community. The work consists of optimizations for multicore processors, GPUs, as well as a mechanism to balance the load between both CPUs and GPUs with the aim to fully exploit the power of modern commodity computers. W...
A Block-Asynchronous Relaxation Method for Graphics Processing Units

OpenAIRE

Anzt, H.; Dongarra, J.; Heuveline, Vincent; Tomov, S.

2011-01-01

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). For this purpose, we developed a set of asynchronous iteration algorithms in CUDA and compared them with a parallel implementation of synchronous relaxation methods on CPU-based systems. For a set of test matrices taken from the University of Florida Matrix Collection we monitor the convergence behavior, the average iteration time and the total time-to-solution time. Analyzing the r...
Fabrication of scalable tissue engineering scaffolds with dual-pore microarchitecture by combining 3D printing and particle leaching

DEFF Research Database (Denmark)

Mohanty, Soumyaranjan; Kuldeep, Kuldeep; Heiskanen, Arto

2016-01-01

Limitations in controlling scaffold architecture using traditional fabrication techniques are a problem when constructing engineered tissues/organs. Recently, integration of two pore architectures to generate dual-pore scaffolds with tailored physical properties has attracted wide attention...... in tissue engineering community. Such scaffolds features primary structured pores which can efficiently enhance nutrient/oxygen supply to the surrounding, in combination with secondary random pores, which give high surface area for cell adhesion and proliferation. Here, we present a new technique...... to fabricate dual-pore scaffolds for various tissue engineering applications where 3D printing of poly(vinyl alcohol) (PVA) mould is combined with salt leaching process. In this technique the sacrificial PVA mould, determining the structured pore architecture, was filled with salt crystals to define the random...
OS Friendly Microprocessor Architecture

Science.gov (United States)

2017-04-01

NOTES Patrick La Fratta is now affiliated with Micron Technology, Inc., Boise, Idaho. 14. ABSTRACT We present an introduction to the patented ...Operating System Friendly Microprocessor Architecture (OSFA). The software framework to support the hardware-level security features is currently patent ...Army is assignee. OS Friendly Microprocessor Architecture. United States Patent 9122610. 2015 Sep. 2. Jungwirth P, inventor; US Army is assignee
MT-ADRES: multi-threading on coarse-grained reconfigurable architecture

DEFF Research Database (Denmark)

Wu, Kehuai; Kanstein, Andreas; Madsen, Jan

2008-01-01

The coarse-grained reconfigurable architecture ADRES (architecture for dynamically reconfigurable embedded systems) and its compiler offer high instruction-level parallelism (ILP) to applications by means of a sparsely interconnected array of functional units and register files. As high-ILP archi......The coarse-grained reconfigurable architecture ADRES (architecture for dynamically reconfigurable embedded systems) and its compiler offer high instruction-level parallelism (ILP) to applications by means of a sparsely interconnected array of functional units and register files. As high......-ILP architectures achieve only low parallelism when executing partially sequential code segments, which is also known as Amdahl's law, this article proposes to extend ADRES to MT-ADRES (multi-threaded ADRES) to also exploit thread-level parallelism. On MT-ADRES architectures, the array can be partitioned...
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Energy Technology Data Exchange (ETDEWEB)

Gawande, Nitin A.; Landwehr, Joshua B.; Daily, Jeffrey A.; Tallent, Nathan R.; Vishnu, Abhinav; Kerbyson, Darren J.

2017-08-24

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Our evaluation consists of a cross section of convolutional neural net workloads: CifarNet, AlexNet, GoogLeNet, and ResNet50 topologies using the Cifar10 and ImageNet datasets. The workloads are vendor-optimized for each architecture. Our analysis indicates that although GPUs provide the highest overall performance, the gap can close for some convolutional networks; and the KNL can be competitive in performance/watt. We find that NVLink facilitates scaling efficiency on GPUs. However, its importance is heavily dependent on neural network architecture. Furthermore, for weak-scaling --- sometimes encouraged by restricted GPU memory --- NVLink is less important.
A multiple-pass ring oscillator based dual-loop phase-locked loop

International Nuclear Information System (INIS)

Chen Danfeng; Ren Junyan; Deng Jingjing; Li Wei; Li Ning

2009-01-01

A dual-loop phase-locked loop (PLL) for wideband operation is proposed. The dual-loop architecture combines a coarse-tuning loop with a fine-tuning one, enabling a wide tuning range and low voltage-controlled oscillator (VCO) gain without poisoning phase noise and reference spur suppression performance. An analysis of the phase noise and reference spur of the dual-loop PLL is emphasized. A novel multiple-pass ring VCO is designed for the dual-loop application. It utilizes both voltage-control and current-control simultaneously in the delay cell. The PLL is fabricated in Jazz 0.18-μm RF CMOS technology. The measured tuning range is from 4.2 to 5.9 GHz. It achieves a low phase noise of -99 dBc/Hz - 1 MHz offset from a 5.5 GHz carrier.
A multiple-pass ring oscillator based dual-loop phase-locked loop

Energy Technology Data Exchange (ETDEWEB)

Chen Danfeng; Ren Junyan; Deng Jingjing; Li Wei; Li Ning, E-mail: dfchen@fudan.edu.c [State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203 (China)

2009-10-15

A dual-loop phase-locked loop (PLL) for wideband operation is proposed. The dual-loop architecture combines a coarse-tuning loop with a fine-tuning one, enabling a wide tuning range and low voltage-controlled oscillator (VCO) gain without poisoning phase noise and reference spur suppression performance. An analysis of the phase noise and reference spur of the dual-loop PLL is emphasized. A novel multiple-pass ring VCO is designed for the dual-loop application. It utilizes both voltage-control and current-control simultaneously in the delay cell. The PLL is fabricated in Jazz 0.18-{mu}m RF CMOS technology. The measured tuning range is from 4.2 to 5.9 GHz. It achieves a low phase noise of -99 dBc/Hz - 1 MHz offset from a 5.5 GHz carrier.
Towards Small-Sized Long Tail Business with the Dual-Directed Recommendation System

Science.gov (United States)

Takahashi, Masakazu; Yamada, Takashi; Tsuda, Kazuhiko; Terano, Takao

This paper describes a novel architecture to promote retail businesses using information recommendation systems. The main features of the architecture are 1) Dual-directed Recommendation system, 2) Portal site for three kinds of users: Producers, Retailers, and Consumers, which are considered to be Prosumers, and 3) Agent-based implementation. We have developed a web-based system DAIKOC (Dynamic Advisor for Information and Knowledge Oriented Communities) with the above architecture. In this paper, we focus on the recommendation functions to extract the items that will achieve the large sales in the future from the ID (IDentification)-POS (Point-Of-Sales) data.
Monte Carlo MP2 on Many Graphical Processing Units.

Science.gov (United States)

Doran, Alexander E; Hirata, So

2016-10-11

In the Monte Carlo second-order many-body perturbation (MC-MP2) method, the long sum-of-product matrix expression of the MP2 energy, whose literal evaluation may be poorly scalable, is recast into a single high-dimensional integral of functions of electron pair coordinates, which is evaluated by the scalable method of Monte Carlo integration. The sampling efficiency is further accelerated by the redundant-walker algorithm, which allows a maximal reuse of electron pairs. Here, a multitude of graphical processing units (GPUs) offers a uniquely ideal platform to expose multilevel parallelism: fine-grain data-parallelism for the redundant-walker algorithm in which millions of threads compute and share orbital amplitudes on each GPU; coarse-grain instruction-parallelism for near-independent Monte Carlo integrations on many GPUs with few and infrequent interprocessor communications. While the efficiency boost by the redundant-walker algorithm on central processing units (CPUs) grows linearly with the number of electron pairs and tends to saturate when the latter exceeds the number of orbitals, on a GPU it grows quadratically before it increases linearly and then eventually saturates at a much larger number of pairs. This is because the orbital constructions are nearly perfectly parallelized on a GPU and thus completed in a near-constant time regardless of the number of pairs. In consequence, an MC-MP2/cc-pVDZ calculation of a benzene dimer is 2700 times faster on 256 GPUs (using 2048 electron pairs) than on two CPUs, each with 8 cores (which can use only up to 256 pairs effectively). We also numerically determine that the cost to achieve a given relative statistical uncertainty in an MC-MP2 energy increases as O(n 3 ) or better with system size n, which may be compared with the O(n 5 ) scaling of the conventional implementation of deterministic MP2. We thus establish the scalability of MC-MP2 with both system and computer sizes.
[The architectural design of psychiatric care buildings].

Science.gov (United States)

Dunet, Lionel

2012-01-01

The architectural design of psychiatric care buildings. In addition to certain "classic" creations, the Dunet architectural office has designed several units for difficult patients as well as a specially adapted hospitalisation unit. These creations which are demanding in terms of the organisation of care require close consultation with the nursing teams. Testimony of an architect who is particularly engaged in the universe of psychiatry.
Redesigning Triangular Dense Matrix Computations on GPUs

KAUST Repository

Charara, Ali

2016-08-09

A new implementation of the triangular matrix-matrix multiplication (TRMM) and the triangular solve (TRSM) kernels are described on GPU hardware accelerators. Although part of the Level 3 BLAS family, these highly computationally intensive kernels fail to achieve the percentage of the theoretical peak performance on GPUs that one would expect when running kernels with similar surface-to-volume ratio on hardware accelerators, i.e., the standard matrix-matrix multiplication (GEMM). The authors propose adopting a recursive formulation, which enriches the TRMM and TRSM inner structures with GEMM calls and, therefore, reduces memory traffic while increasing the level of concurrency. The new implementation enables efficient use of the GPU memory hierarchy and mitigates the latency overhead, to run at the speed of the higher cache levels. Performance comparisons show up to eightfold and twofold speedups for large dense matrix sizes, against the existing state-of-the-art TRMM and TRSM implementations from NVIDIA cuBLAS, respectively, across various GPU generations. Once integrated into high-level Cholesky-based dense linear algebra algorithms, the performance impact on the overall applications demonstrates up to fourfold and twofold speedups, against the equivalent native implementations, linked with cuBLAS TRMM and TRSM kernels, respectively. The new TRMM/TRSM kernel implementations are part of the open-source KBLAS software library (http://ecrc.kaust.edu.sa/Pages/Res-kblas.aspx) and are lined up for integration into the NVIDIA cuBLAS library in the upcoming v8.0 release.
High performance deformable image registration algorithms for manycore processors

CERN Document Server

Shackleford, James; Sharp, Gregory

2013-01-01

High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E
Heterogeneous computing with OpenCL 2.0

CERN Document Server

Kaeli, David R; Schaa, Dana; Zhang, Dong Ping

2015-01-01

Heterogeneous Computing with OpenCL 2.0 teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs). This fully-revised edition includes the latest enhancements in OpenCL 2.0 including: Shared virtual memory to increase programming flexibility and reduce data transfers that consume resources Dynamic parallelism which reduces processor load and avoids bottlenecks Improved imaging support and integration with OpenGL Designed to work on multiple platfor
Dual leadership in a hospital practice

DEFF Research Database (Denmark)

Thude, Bettina Ravnborg; Thomsen, Svend Erik; Stenager, Egon

2017-01-01

, this study aims to analyse three different dual leadership pairs at a Danish hospital. Furthermore, this study develops a tool to characterize dual leadership teams from each other. Design/methodology/approach This is a qualitative study using semi-structured interviews. Six leaders were interviewed...... to clarify how dual leadership works in a hospital context. All interviews were transcribed and coded. During coding, focus was on the nine principles found in the literature and another principle was found by looking at the themes that were generic for all six interviews. Findings Results indicate......Purpose Despite the practice of dual leadership in many organizations, there is relatively little research on the topic. Dual leadership means two leaders share the leadership task and are held jointly accountable for the results of the unit. To better understand how dual leadership works...
Architectural transformations in network services and distributed systems

CERN Document Server

Luntovskyy, Andriy

2017-01-01

With the given work we decided to help not only the readers but ourselves, as the professionals who actively involved in the networking branch, with understanding the trends that have developed in recent two decades in distributed systems and networks. Important architecture transformations of distributed systems have been examined. The examples of new architectural solutions are discussed. Content Periodization of service development Energy efficiency Architectural transformations in Distributed Systems Clustering and Parallel Computing, performance models Cloud Computing, RAICs, Virtualization, SDN Smart Grid, Internet of Things, Fog Computing Mobile Communication from LTE to 5G, DIDO, SAT-based systems Data Security Guaranteeing Distributed Systems Target Groups Students in EE and IT of universities and (dual) technical high schools Graduated engineers as well as teaching staff About the Authors Andriy Luntovskyy provides classes on networks, mobile communication, software technology, distributed systems, ...
Unit 1A: General Approach to the Teaching of Architecture

DEFF Research Database (Denmark)

Gammelgaard Nielsen, Anders

2011-01-01

An ideal course Ever since the founding of the Aar- hus School of Architecture in 1965 there has been a tradition for lively discussion surrounding the content of the architecture program. The discussion has often been con- ducted from ideological or norma- tive positions, with the tendency to st...
Planet-disc interactions with Discontinuous Galerkin Methods using GPUs

Science.gov (United States)

Velasco Romero, David A.; Veiga, Maria Han; Teyssier, Romain; Masset, Frédéric S.

2018-05-01

We present a two-dimensional Cartesian code based on high order discontinuous Galerkin methods, implemented to run in parallel over multiple GPUs. A simple planet-disc setup is used to compare the behaviour of our code against the behaviour found using the FARGO3D code with a polar mesh. We make use of the time dependence of the torque exerted by the disc on the planet as a mean to quantify the numerical viscosity of the code. We find that the numerical viscosity of the Keplerian flow can be as low as a few 10-8r2Ω, r and Ω being respectively the local orbital radius and frequency, for fifth order schemes and resolution of ˜10-2r. Although for a single disc problem a solution of low numerical viscosity can be obtained at lower computational cost with FARGO3D (which is nearly an order of magnitude faster than a fifth order method), discontinuous Galerkin methods appear promising to obtain solutions of low numerical viscosity in more complex situations where the flow cannot be captured on a polar or spherical mesh concentric with the disc.
Adolescents and Dual Diagnosis in a Psychiatric Emergency Service.

Science.gov (United States)

Matali, José Luis; Andión, Oscar; Pardo, Marta; Iniesta, Raquel; Serrano, Eduard; San, Luis

2016-03-02

In recent years, both the prevalence of drug use and related child and adolescent psychiatric emergencies have risen sharply. There are few studies about the impact on child and adolescent emergency services. This study has a twofold aim. The first is to describe the prevalence of substance use disorders, mental disorders and dual diagnosis (substance use problems plus mental disorder) in adolescents in psychiatric emergency service. The second is to analyze clinical and healthcare differences between patients with dual diagnosis and patients with a mental disorder without substance use disorder.We retrospectively reviewed 4012 discharge forms for emergencies treated at the psychiatric emergency department during the period 2007-2009. We obtained a sample of 1795 visits. This sample was divided into two groups: the dual diagnosis group (n = 477) and the psychiatric disorder group (n = 1318).The dual diagnosis group accounted for 26.5% of psychiatric emergencies analyzed. Compared to the psychiatric disorder group,the dual diagnosis group had significantly more conduct disorders, social problems, involuntariness in the visit, less hospital admissions and less connection with the healthcare network.Adolescents with a dual diagnosis account for a high percentage of visits at child and adolescent psychiatric emergency services. This patient group requires specialized care both at emergency services and in specific units. Accordingly, these units should play a triple role when handling dual diagnosis: detection, brief treatment and referral to a specialised unit.
High performance 3D neutron transport on peta scale and hybrid architectures within APOLLO3 code

International Nuclear Information System (INIS)

Jamelot, E.; Dubois, J.; Lautard, J-J.; Calvin, C.; Baudron, A-M.

2011-01-01

APOLLO3 code is a common project of CEA, AREVA and EDF for the development of a new generation system for core physics analysis. We present here the parallelization of two deterministic transport solvers of APOLLO3: MINOS, a simplified 3D transport solver on structured Cartesian and hexagonal grids, and MINARET, a transport solver based on triangular meshes on 2D and prismatic ones in 3D. We used two different techniques to accelerate MINOS: a domain decomposition method, combined with an accelerated algorithm using GPU. The domain decomposition is based on the Schwarz iterative algorithm, with Robin boundary conditions to exchange information. The Robin parameters influence the convergence and we detail how we optimized the choice of these parameters. MINARET parallelization is based on angular directions calculation using explicit message passing. Fine grain parallelization is also available for each angular direction using shared memory multithreaded acceleration. Many performance results are presented on massively parallel architectures using more than 103 cores and on hybrid architectures using some tens of GPUs. This work contributes to the HPC development in reactor physics at the CEA Nuclear Energy Division. (author)

Degree of conversion of resin-based materials cured with dual-peak or single-peak LED light-curing units.

Science.gov (United States)

Lucey, Siobhan M; Santini, Ario; Roebuck, Elizabeth M

2015-03-01

There is a lack of data on polymerization of resin-based materials (RBMs) used in paediatric dentistry, using dual-peak light-emitting diode (LED) light-curing units (LCUs). To evaluate the degree of conversion (DC) of RBMs cured with dual-peak or single-peak LED LCUs. Samples of Vit-l-escence (Ultradent) and Herculite XRV Ultra (Kerr) and fissure sealants Delton Clear and Delton Opaque (Dentsply) were prepared (n = 3 per group) and cured with either one of two dual-peak LCUs (bluephase(®) G2; Ivoclar Vivadent or Valo; Ultradent) or a single-peak (bluephase(®) ; Ivoclar Vivadent). High-performance liquid chromatography and nuclear magnetic resonance spectroscopy were used to confirm the presence or absence of initiators other than camphorquinone. The DC was determined using micro-Raman spectroscopy. Data were analysed using general linear model anova; α = 0.05. With Herculite XRV Ultra, the single-peak LCU gave higher DC values than either of the two dual-peak LCUs (P < 0.05). Both fissure sealants showed higher DC compared with the two RBMs (P < 0.05); the DC at the bottom of the clear sealant was greater than the opaque sealant, (P < 0.05). 2,4,6-trimethylbenzoyldiphenylphosphine oxide (Lucirin(®) TPO) was found only in Vit-l-escence. Dual-peak LED LCUs may not be best suited for curing non-Lucirin(®) TPO-containing materials. A clear sealant showed a better cure throughout the material and may be more appropriate than opaque versions in deep fissures. © 2014 BSPD, IAPD and John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Dual Decomposition for Large-Scale Power Balancing

DEFF Research Database (Denmark)

Halvgaard, Rasmus; Jørgensen, John Bagterp; Vandenberghe, Lieven

2013-01-01

Dual decomposition is applied to power balancing of exible thermal storage units. The centralized large-scale problem is decomposed into smaller subproblems and solved locallyby each unit in the Smart Grid. Convergence is achieved by coordinating the units consumption through a negotiation...
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit

International Nuclear Information System (INIS)

Badal, Andreu; Badano, Aldo

2009-01-01

Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit

Energy Technology Data Exchange (ETDEWEB)

Badal, Andreu; Badano, Aldo [Division of Imaging and Applied Mathematics, OSEL, CDRH, U.S. Food and Drug Administration, Silver Spring, Maryland 20993-0002 (United States)

2009-11-15

Purpose: It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). Methods: A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDA programming model (NVIDIA Corporation, Santa Clara, CA). Results: An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. Conclusions: The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit.

Science.gov (United States)

Badal, Andreu; Badano, Aldo

2009-11-01

It is a known fact that Monte Carlo simulations of radiation transport are computationally intensive and may require long computing times. The authors introduce a new paradigm for the acceleration of Monte Carlo simulations: The use of a graphics processing unit (GPU) as the main computing device instead of a central processing unit (CPU). A GPU-based Monte Carlo code that simulates photon transport in a voxelized geometry with the accurate physics models from PENELOPE has been developed using the CUDATM programming model (NVIDIA Corporation, Santa Clara, CA). An outline of the new code and a sample x-ray imaging simulation with an anthropomorphic phantom are presented. A remarkable 27-fold speed up factor was obtained using a GPU compared to a single core CPU. The reported results show that GPUs are currently a good alternative to CPUs for the simulation of radiation transport. Since the performance of GPUs is currently increasing at a faster pace than that of CPUs, the advantages of GPU-based software are likely to be more pronounced in the future.
Architecture and functional ecology of the human gastrocnemius muscle-tendon unit.

Science.gov (United States)

Butler, Erin E; Dominy, Nathaniel J

2016-04-01

The gastrocnemius muscle-tendon unit (MTU) is central to human locomotion. Structural variation in the human gastrocnemius MTU is predicted to affect the efficiency of locomotion, a concept most often explored in the context of performance activities. For example, stiffness of the Achilles tendon varies among individuals with different histories of competitive running. Such a finding highlights the functional variation of individuals and raises the possibility of similar variation between populations, perhaps in response to specific ecological or environmental demands. Researchers often assume minimal variation in human populations, or that industrialized populations represent the human species as well as any other. Yet rainforest hunter-gatherers, which often express the human pygmy phenotype, contradict such assumptions. Indeed, the human pygmy phenotype is a potential model system for exploring the range of ecomorphological variation in the architecture of human hindlimb muscles, a concept we review here. © 2015 Anatomical Society.
A 10 Gb/s passive-components-based WDM-TDM reconfigurable optical access network architecture

NARCIS (Netherlands)

Tran, N.C.; Jung, H.D.; Okonkwo, C.M.; Tangdiongga, E.; Koonen, A.M.J.

2011-01-01

We propose a cost-effective, reconfigurable optical access network by employing passive components in the remote node and dual conventional optical transceivers in ONUs. The architecture is demonstrated with bidirectional transmission at 10 Gb/s.
Automatic Functionality Assignment to AUTOSAR Multicore Distributed Architectures

DEFF Research Database (Denmark)

Maticu, Florin; Pop, Paul; Axbrink, Christian

2016-01-01

The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, where several functions may share resources on an ECU. In addition, multicore ECUs are being adopted because...... of better performance, cost, size, fault-tolerance and power consumption. In this paper we present an approach for the automatic software functionality assignment to multicore distributed architectures. We consider that the systems use the AUTomotive Open System ARchitecture (AUTOSAR). The functionality...
BROCCOLI: Software for Fast fMRI Analysis on Many-Core CPUs and GPUs

Directory of Open Access Journals (Sweden)

Anders eEklund

2014-03-01

Full Text Available Analysis of functional magnetic resonance imaging (fMRI data is becoming ever more computationally demanding as temporal and spatial resolutions improve, and large, publicly available data sets proliferate. Moreover, methodological improvements in the neuroimaging pipeline, such as non-linear spatial normalization, non-parametric permutation tests and Bayesian Markov Chain Monte Carlo approaches, can dramatically increase the computational burden. Despite these challenges, there do not yet exist any fMRI software packages which leverage inexpensive and powerful graphics processing units (GPUs to perform these analyses. Here, we therefore present BROCCOLI, a free software package written in OpenCL (Open Computing Language that can be used for parallel analysis of fMRI data on a large variety of hardware configurations. BROCCOLI has, for example, been tested with an Intel CPU, an Nvidia GPU and an AMD GPU. These tests show that parallel processing of fMRI data can lead to significantly faster analysis pipelines. This speedup can be achieved on relatively standard hardware, but further, dramatic speed improvements require only a modest investment in GPU hardware. BROCCOLI (running on a GPU can perform non-linear spatial normalization to a 1 mm3 brain template in 4-6 seconds, and run a second level permutation test with 10,000 permutations in about a minute. These non-parametric tests are generally more robust than their parametric counterparts, and can also enable more sophisticated analyses by estimating complicated null distributions. Additionally, BROCCOLI includes support for Bayesian first-level fMRI analysis using a Gibbs sampler. The new software is freely available under GNU GPL3 and can be downloaded from github (https://github.com/wanderine/BROCCOLI/.
Dual-mode ultraflow access networks: a hybrid solution for the access bottleneck

Science.gov (United States)

Kazovsky, Leonid G.; Shen, Thomas Shunrong; Dhaini, Ahmad R.; Yin, Shuang; De Leenheer, Marc; Detwiler, Benjamin A.

2013-12-01

Optical Flow Switching (OFS) is a promising solution for large Internet data transfers. In this paper, we introduce UltraFlow Access, a novel optical access network architecture that offers dual-mode service to its end-users: IP and OFS. With UltraFlow Access, we design and implement a new dual-mode control plane and a new dual-mode network stack to ensure efficient connection setup and reliable and optimal data transmission. We study the impact of the UltraFlow system's design on the network throughput. Our experimental results show that with an optimized system design, near optimal (around 10 Gb/s) OFS data throughput can be attained when the line rate is 10Gb/s.
FPGAs and parallel architectures for aerospace applications soft errors and fault-tolerant design

CERN Document Server

Rech, Paolo

2016-01-01

This book introduces the concepts of soft errors in FPGAs, as well as the motivation for using commercial, off-the-shelf (COTS) FPGAs in mission-critical and remote applications, such as aerospace. The authors describe the effects of radiation in FPGAs, present a large set of soft-error mitigation techniques that can be applied in these circuits, as well as methods for qualifying these circuits under radiation. Coverage includes radiation effects in FPGAs, fault-tolerant techniques for FPGAs, use of COTS FPGAs in aerospace applications, experimental data of FPGAs under radiation, FPGA embedded processors under radiation, and fault injection in FPGAs. Since dedicated parallel processing architectures such as GPUs have become more desirable in aerospace applications due to high computational power, GPU analysis under radiation is also discussed. · Discusses features and drawbacks of reconfigurability methods for FPGAs, focused on aerospace applications; · Explains how radia...
A versatile model for soft patchy particles with various patch arrangements.

Science.gov (United States)

Li, Zhan-Wei; Zhu, You-Liang; Lu, Zhong-Yuan; Sun, Zhao-Yan

2016-01-21

We propose a simple and general mesoscale soft patchy particle model, which can felicitously describe the deformable and surface-anisotropic characteristics of soft patchy particles. This model can be used in dynamics simulations to investigate the aggregation behavior and mechanism of various types of soft patchy particles with tunable number, size, direction, and geometrical arrangement of the patches. To improve the computational efficiency of this mesoscale model in dynamics simulations, we give the simulation algorithm that fits the compute unified device architecture (CUDA) framework of NVIDIA graphics processing units (GPUs). The validation of the model and the performance of the simulations using GPUs are demonstrated by simulating several benchmark systems of soft patchy particles with 1 to 4 patches in a regular geometrical arrangement. Because of its simplicity and computational efficiency, the soft patchy particle model will provide a powerful tool to investigate the aggregation behavior of soft patchy particles, such as patchy micelles, patchy microgels, and patchy dendrimers, over larger spatial and temporal scales.
cudaBayesreg: Parallel Implementation of a Bayesian Multilevel Model for fMRI Data Analysis

Directory of Open Access Journals (Sweden)

Adelino R. Ferreira da Silva

2011-10-01

Full Text Available Graphic processing units (GPUs are rapidly gaining maturity as powerful general parallel computing devices. A key feature in the development of modern GPUs has been the advancement of the programming model and programming tools. Compute Unified Device Architecture (CUDA is a software platform for massively parallel high-performance computing on Nvidia many-core GPUs. In functional magnetic resonance imaging (fMRI, the volume of the data to be processed, and the type of statistical analysis to perform call for high-performance computing strategies. In this work, we present the main features of the R-CUDA package cudaBayesreg which implements in CUDA the core of a Bayesian multilevel model for the analysis of brain fMRI data. The statistical model implements a Gibbs sampler for multilevel/hierarchical linear models with a normal prior. The main contribution for the increased performance comes from the use of separate threads for fitting the linear regression model at each voxel in parallel. The R-CUDA implementation of the Bayesian model proposed here has been able to reduce significantly the run-time processing of Markov chain Monte Carlo (MCMC simulations used in Bayesian fMRI data analyses. Presently, cudaBayesreg is only configured for Linux systems with Nvidia CUDA support.
76 FR 55944 - In the Matter of Certain Electronic Devices With Image Processing Systems, Components Thereof...

Science.gov (United States)

2011-09-09

... having graphics processing units (``GPUs'') supplied by NVIDIA Corporation (``NVIDIA'') infringe any... show the ALJ addressed infringement relating to the NVIDIA GPUs; and (b) the evidence in the record, if any, that accused articles incorporating the NVIDIA GPUs infringe an asserted patent claim. Please...
Dual diagnosis resource needs in Spain: a national survey of professionals.

Science.gov (United States)

Szerman, Nestor; Vega, Pablo; Grau-López, Lara; Barral, Carmen; Basurte-Villamor, Ignacio; Mesías, Beatriz; Rodríguez-Cintas, Laia; Martínez-Raga, José; Casas, Miguel; Roncero, Carlos

2014-01-01

Since provision of integrated services for patients with dual pathology or dual disorders (coexistence of an addictive disorder and another mental health disorder) is an important challenge in mental health, this study assessed health care professionals' perceptions and knowledge of the current state of specific resources for patients with dual pathology in Spain. We conducted a national survey of health care professionals seeing patients with dual pathology in treatment facilities throughout Spain. Participants completed a specific online questionnaire about the needs of and available resources for patients with dual pathology. A total of 659 professionals, mostly psychologists (n = 286, 43.4%) or psychiatrists (n = 217, 32.9%), participated in the study. Nearly all participants who responded to these items reported that specific resources for dual pathology were needed (n = 592/635, 93.2%); 76.7% (n = 487) identified intermediate resources, 68.8% (n = 437) acute detoxification units, and 64.6% (n = 410) medium-stay rehabilitation units as particularly necessary. In the opinion of 54.0% of respondents (n = 343), integrated mental health and addiction treatment services were available. Of the participants who answered these items, only a small proportion (n = 162/605, 26.8%) reported that there were appropriate outpatient programs for dual pathology, 30.4% (n = 184/605) specific hospitalization units, 16.9% (n = 99/587) subacute inpatient units, 34.2% (n = 201/587) outpatient intermediate resources, 15.5% (n = 91/587) day hospitals, and 21.5% (n = 126/587) day centers. Conversely, 62.5% (n = 378/587) of participants reported a greater presence of specific detoxification/withdrawal units, 47.3% (n = 286/587) psychiatric acute admission units, and 41.9% (n = 246/587) therapeutic communities. In the professionals' opinion, the presence of specialty programs was low; 11.6% of respondents (n = 68/587) reported that vocational programs and 16.7% (n = 98/587) reported
A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology

International Nuclear Information System (INIS)

Rota, L.; Vogelgesang, M.; Perez, L.E. Ardila; Caselle, M.; Chilingaryan, S.; Dritschler, T.; Zilio, N.; Kopmann, A.; Balzer, M.; Weber, M.

2016-01-01

Modern physics experiments produce multi-GB/s data rates. Fast data links and high performance computing stages are required for continuous data acquisition and processing. Because of their intrinsic parallelism and computational power, GPUs emerged as an ideal solution to process this data in high performance computing applications. In this paper we present a high-throughput platform based on direct FPGA-GPU communication. The architecture consists of a Direct Memory Access (DMA) engine compatible with the Xilinx PCI-Express core, a Linux driver for register access, and high- level software to manage direct memory transfers using AMD's DirectGMA technology. Measurements with a Gen3 x8 link show a throughput of 6.4 GB/s for transfers to GPU memory and 6.6 GB/s to system memory. We also assess the possibility of using the architecture in low latency systems: preliminary measurements show a round-trip latency as low as 1 μs for data transfers to system memory, while the additional latency introduced by OpenCL scheduling is the current limitation for GPU based systems. Our implementation is suitable for real-time DAQ system applications ranging from photon science and medical imaging to High Energy Physics (HEP) systems
Dual-processing accounts of reasoning, judgment, and social cognition.

Science.gov (United States)

Evans, Jonathan St B T

2008-01-01

This article reviews a diverse set of proposals for dual processing in higher cognition within largely disconnected literatures in cognitive and social psychology. All these theories have in common the distinction between cognitive processes that are fast, automatic, and unconscious and those that are slow, deliberative, and conscious. A number of authors have recently suggested that there may be two architecturally (and evolutionarily) distinct cognitive systems underlying these dual-process accounts. However, it emerges that (a) there are multiple kinds of implicit processes described by different theorists and (b) not all of the proposed attributes of the two kinds of processing can be sensibly mapped on to two systems as currently conceived. It is suggested that while some dual-process theories are concerned with parallel competing processes involving explicit and implicit knowledge systems, others are concerned with the influence of preconscious processes that contextualize and shape deliberative reasoning and decision-making.
Fabrication of scalable tissue engineering scaffolds with dual-pore microarchitecture by combining 3D printing and particle leaching.

Science.gov (United States)

Mohanty, Soumyaranjan; Sanger, Kuldeep; Heiskanen, Arto; Trifol, Jon; Szabo, Peter; Dufva, Marin; Emnéus, Jenny; Wolff, Anders

2016-04-01

Limitations in controlling scaffold architecture using traditional fabrication techniques are a problem when constructing engineered tissues/organs. Recently, integration of two pore architectures to generate dual-pore scaffolds with tailored physical properties has attracted wide attention in tissue engineering community. Such scaffolds features primary structured pores which can efficiently enhance nutrient/oxygen supply to the surrounding, in combination with secondary random pores, which give high surface area for cell adhesion and proliferation. Here, we present a new technique to fabricate dual-pore scaffolds for various tissue engineering applications where 3D printing of poly(vinyl alcohol) (PVA) mould is combined with salt leaching process. In this technique the sacrificial PVA mould, determining the structured pore architecture, was filled with salt crystals to define the random pore regions of the scaffold. After crosslinking the casted polymer the combined PVA-salt mould was dissolved in water. The technique has advantages over previously reported ones, such as automated assembly of the sacrificial mould, and precise control over pore architecture/dimensions by 3D printing parameters. In this study, polydimethylsiloxane and biodegradable poly(ϵ-caprolactone) were used for fabrication. However, we show that this technique is also suitable for other biocompatible/biodegradable polymers. Various physical and mechanical properties of the dual-pore scaffolds were compared with control scaffolds with either only structured or only random pores, fabricated using previously reported methods. The fabricated dual-pore scaffolds supported high cell density, due to the random pores, in combination with uniform cell distribution throughout the scaffold, and higher cell proliferation and viability due to efficient nutrient/oxygen transport through the structured pores. In conclusion, the described fabrication technique is rapid, inexpensive, scalable, and compatible
A quantitative theory of the Hounsfield unit and its application to dual energy scanning.

Science.gov (United States)

Brooks, R A

1977-10-01

A standard definition is proposed for the Hounsfield number. Any number in computed tomography can be converted to the Hounsfield scale after performing a simple calibration using air and water. The energy dependence of the Hounsfield number, H, is given by the expression H = (Hc + Hp Q)/(1 + Q), where Hc and Hp are the Compton and photoelectric coefficients of the material being measured, expressed in Hounsfield units, and Q is the "quality factor" of the scanner. Q can be measured by performing a scan of a single calibrating material, such as a potassium iodine solution. By applying this analysis to dual energy scans, the Compton and photoelectric coefficients of an unknown substance may easily be obtained. This can lead to a limited degree of chemical identification.
Dual mode linguistic hedge fuzzy logic controller for an isolated wind-diesel hybrid power system with superconducting magnetic energy storage unit

International Nuclear Information System (INIS)

Thameem Ansari, M.Md.; Velusami, S.

2010-01-01

A design of dual mode linguistic hedge fuzzy logic controller for an isolated wind-diesel hybrid power system with superconducting magnetic energy storage unit is proposed in this paper. The design methodology of dual mode linguistic hedge fuzzy logic controller is a hybrid model based on the concepts of linguistic hedges and hybrid genetic algorithm-simulated annealing algorithms. The linguistic hedge operators are used to adjust the shape of the system membership functions dynamically and can speed up the control result to fit the system demand. The hybrid genetic algorithm-simulated annealing algorithm is adopted to search the optimal linguistic hedge combination in the linguistic hedge module. Dual mode concept is also incorporated in the proposed controller because it can improve the system performance. The system with the proposed controller was simulated and the frequency deviation resulting from a step load disturbance is presented. The comparison of the proportional plus integral controller, fuzzy logic controller and the proposed dual mode linguistic hedge fuzzy logic controller shows that, with the application of the proposed controller, the system performance is improved significantly. The proposed controller is also found to be less sensitive to the changes in the parameters of the system and also robust under different operating modes of the hybrid power system.

Report of the CIRRPC Policy Subpanel on SI metric radiation units

International Nuclear Information System (INIS)

1986-12-01

Recognizing that use of the International System of Units (SI) for radiological quantities is increasing internationally but is not currently widely accepted in the United States, and recognizing that the existing US policy is to plan for the increasing voluntary use of SI units domestically, it is recommended that it be US policy to use dual radiation units in Federal activities. However, it is recognized that in certain operational situations, by reason of economy or safety, the utilization of dual units is undesirable. Therefore, in justified situations, agencies may adopt that system of units which best meets their needs. The objective of the recommended use of dual units is primarily to familiarize people with the SI units. This should serve to ease transition when and where transition is appropriate and mitigate economic and safety concerns. The Subpanel has made suggestions on how this policy should be implemented by Federal agencies: using dual units in issuing regulations containing radiation units, except where determined to be impractical, incorporating dual units in agency internal operating procedures as they are written or revised; and using dual units in contracts and procurement. The Subpanel recommends that this policy be reexamined in about five years following an assessment of the use of SI radiation units both internationally and within the United States. 11 refs
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Science.gov (United States)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Directory of Open Access Journals (Sweden)

DeTar Carleton

2018-01-01

Full Text Available With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Multidimensional upwind hydrodynamics on unstructured meshes using graphics processing units - I. Two-dimensional uniform meshes

Science.gov (United States)

Paardekooper, S.-J.

2017-08-01

We present a new method for numerical hydrodynamics which uses a multidimensional generalization of the Roe solver and operates on an unstructured triangular mesh. The main advantage over traditional methods based on Riemann solvers, which commonly use one-dimensional flux estimates as building blocks for a multidimensional integration, is its inherently multidimensional nature, and as a consequence its ability to recognize multidimensional stationary states that are not hydrostatic. A second novelty is the focus on graphics processing units (GPUs). By tailoring the algorithms specifically to GPUs, we are able to get speedups of 100-250 compared to a desktop machine. We compare the multidimensional upwind scheme to a traditional, dimensionally split implementation of the Roe solver on several test problems, and we find that the new method significantly outperforms the Roe solver in almost all cases. This comes with increased computational costs per time-step, which makes the new method approximately a factor of 2 slower than a dimensionally split scheme acting on a structured grid.
Performance Evaluation of a Dual Coverage System for Internet of Things Environments

Directory of Open Access Journals (Sweden)

Omar Said

2016-01-01

Full Text Available A dual coverage system for Internet of Things (IoT environments is introduced. This system is used to connect IoT nodes regardless of their locations. The proposed system has three different architectures, which are based on satellites and High Altitude Platforms (HAPs. In case of Internet coverage problems, the Internet coverage will be replaced with the Satellite/HAP network coverage under specific restrictions such as loss and delay. According to IoT requirements, the proposed architectures should include multiple levels of satellites or HAPs, or a combination of both, to cover the global Internet things. It was shown that the Satellite/HAP/HAP/Things architecture provides the largest coverage area. A network simulation package, NS2, was used to test the performance of the proposed multilevel architectures. The results indicated that the HAP/HAP/Things architecture has the best end-to-end delay, packet loss, throughput, energy consumption, and handover.
Border information flow architecture

Science.gov (United States)

2006-04-01

This brochure describes the Border Information Flow Architecture (BIFA). The Transportation Border Working Group, a bi-national group that works to enhance coordination and planning between the United States and Canada, identified collaboration on th...
Dual Connectivity in LTE HetNets with Split Control- and User-Plane

DEFF Research Database (Denmark)

Zakrzewska, Anna; López-Pérez, David; Kucera, Stepan

2013-01-01

a detailed description of our dual connectivity framework based on the latest LTE-Advanced enhancements, in which macrocellassisted (MA) small cells use different channel state informationreference signals (CSI-RS) to differentiate among each other and allow User Equipment (UE) to take adequate measurements......Recently, a new network architecture with split control-plane and user-plane has been proposed and gained a lot of momentum in the standardisation of Long Term Evolution (LTE) Release 12. In this new network architecture, the controlplane, which transmits system information and handles user...
The architecture of information architecture, interaction design and the patterning of digital information

CERN Document Server

Dade-Robertson, Martyn

2011-01-01

This book looks at relationships between the organization of physical objects in space and the organization of ideas. Historical, philosophical, psychological and architectural knowledge are united to develop an understanding of the relationship between information and its representation.Despite its potential to break the mould, digital information has relied on metaphors from a pre-digital era. In particular, architectural ideas have pervaded discussions of digital information, from the urbanization of cyberspace in science fiction, through to the adoption of spatial visualiz
Dual fuel injection piggyback controller system

Science.gov (United States)

Muji, Siti Zarina Mohd.; Hassanal, Muhammad Amirul Hafeez; Lee, Chua King; Fawzi, Mas; Zulkifli, Fathul Hakim

2017-09-01

Dual-fuel injection is an effort to reduce the dependency on diesel and gasoline fuel. Generally, there are two approaches to implement the dual-fuel injection in car system. The first approach is changing the whole injector of the car engine, the consequence is excessive high cost. Alternatively, it also can be achieved by manipulating the system's control signal especially the Electronic Control Unit (ECU) signal. Hence, the study focuses to develop a dual injection timing controller system that likely adopted to control injection time and quantity of compressed natural gas (CNG) and diesel fuel. In this system, Raspberry Pi 3 reacts as main controller unit to receive ECU signal, analyze it and then manipulate its duty cycle to be fed into the Electronic Driver Unit (EDU). The manipulation has changed the duty cycle to two pulses instead of single pulse. A particular pulse mainly used to control injection of diesel fuel and another pulse controls injection of Compressed Natural Gas (CNG). The test indicated promising results that the system can be implemented in the car as piggyback system. This article, which was originally published online on 14 September 2017, contained an error in the acknowledgment section. The corrected acknowledgment appears in the Corrigendum attached to the pdf.
Design of Carborane Molecular Architectures via Electronic Structure Computations

International Nuclear Information System (INIS)

Oliva, J.M.; Serrano-Andres, L.; Klein, D.J.; Schleyer, P.V.R.; Mich, J.

2009-01-01

Quantum-mechanical electronic structure computations were employed to explore initial steps towards a comprehensive design of poly carborane architectures through assembly of molecular units. Aspects considered were (i) the striking modification of geometrical parameters through substitution, (ii) endohedral carboranes and proposed ejection mechanisms for energy/ion/atom/energy storage/transport, (iii) the excited state character in single and dimeric molecular units, and (iv) higher architectural constructs. A goal of this work is to find optimal architectures where atom/ion/energy/spin transport within carborane superclusters is feasible in order to modernize and improve future photo energy processes.
32 CFR 1630.42 - Class 4-C: Alien or dual national.

Science.gov (United States)

2010-07-01

... 32 National Defense 6 2010-07-01 2010-07-01 false Class 4-C: Alien or dual national. 1630.42... CLASSIFICATION RULES § 1630.42 Class 4-C: Alien or dual national. In Class 4-C shall be placed any registrant who... service in the United States. (b) Is an alien and who has departed from the United States prior to being...
Monte Carlo electron-photon transport using GPUs as an accelerator: Results for a water-aluminum-water phantom

Energy Technology Data Exchange (ETDEWEB)

Su, L.; Du, X.; Liu, T.; Xu, X. G. [Nuclear Engineering Program, Rensselaer Polytechnic Institute, Troy, NY 12180 (United States)

2013-07-01

An electron-photon coupled Monte Carlo code ARCHER - Accelerated Radiation-transport Computations in Heterogeneous Environments - is being developed at Rensselaer Polytechnic Institute as a software test bed for emerging heterogeneous high performance computers that utilize accelerators such as GPUs. In this paper, the preliminary results of code development and testing are presented. The electron transport in media was modeled using the class-II condensed history method. The electron energy considered ranges from a few hundred keV to 30 MeV. Moller scattering and bremsstrahlung processes above a preset energy were explicitly modeled. Energy loss below that threshold was accounted for using the Continuously Slowing Down Approximation (CSDA). Photon transport was dealt with using the delta tracking method. Photoelectric effect, Compton scattering and pair production were modeled. Voxelised geometry was supported. A serial ARHCHER-CPU was first written in C++. The code was then ported to the GPU platform using CUDA C. The hardware involved a desktop PC with an Intel Xeon X5660 CPU and six NVIDIA Tesla M2090 GPUs. ARHCHER was tested for a case of 20 MeV electron beam incident perpendicularly on a water-aluminum-water phantom. The depth and lateral dose profiles were found to agree with results obtained from well tested MC codes. Using six GPU cards, 6x10{sup 6} histories of electrons were simulated within 2 seconds. In comparison, the same case running the EGSnrc and MCNPX codes required 1645 seconds and 9213 seconds, respectively, on a CPU with a single core used. (authors)
Getting To Exascale: Applying Novel Parallel Programming Models To Lab Applications For The Next Generation Of Supercomputers

Energy Technology Data Exchange (ETDEWEB)

Dube, Evi [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Shereda, Charles [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Nau, Lee [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Harris, Lance [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

2010-09-27

As supercomputing moves toward exascale, node architectures will change significantly. CPU core counts on nodes will increase by an order of magnitude or more. Heterogeneous architectures will become more commonplace, with GPUs or FPGAs providing additional computational power. Novel programming models may make better use of on-node parallelism in these new architectures than do current models. In this paper we examine several of these novel models – UPC, CUDA, and OpenCL –to determine their suitability to LLNL scientific application codes. Our study consisted of several phases: We conducted interviews with code teams and selected two codes to port; We learned how to program in the new models and ported the codes; We debugged and tuned the ported applications; We measured results, and documented our findings. We conclude that UPC is a challenge for porting code, Berkeley UPC is not very robust, and UPC is not suitable as a general alternative to OpenMP for a number of reasons. CUDA is well supported and robust but is a proprietary NVIDIA standard, while OpenCL is an open standard. Both are well suited to a specific set of application problems that can be run on GPUs, but some problems are not suited to GPUs. Further study of the landscape of novel models is recommended.
Dual-systems and the development of reasoning: competence-procedural systems.

Science.gov (United States)

Overton, Willis F; Ricco, Robert B

2011-03-01

Dual-system, dual-process, accounts of adult cognitive processing are examined in the context of a self-organizing relational developmental systems approaches to cognitive growth. Contemporary adult dual-process accounts describe a linear architecture of mind entailing two split-off, but interacting systems; a domain general, content-free 'analytic' system (system 2) and a domain specific highly contextualized 'heuristic' system (system 1). In the developmental literature on deductive reasoning, a similar distinction has been made between a domain general competence (reflective, algorithmic) system and a domain specific procedural system. In contrast to the linear accounts offered by empiricist, nativist, and/or evolutionary explanations, the dual competence-procedural developmental perspective argues that the mature systems emerge through developmental transformations as differentiations and intercoordinations of an early relatively undifferentiated action matrix. This development, whose microscopic mechanism is action-in-the-world, is characterized as being embodied, nonlinear, and epigenetic. WIREs Cogni Sci 2011 2 231-237 DOI: 10.1002/wcs.120 For further resources related to this article, please visit the WIREs website. © 2010 John Wiley & Sons, Ltd.
DUAL TIMELIKE NORMAL AND DUAL TIMELIKE SPHERICAL CURVES IN DUAL MINKOWSKI SPACE

OpenAIRE

ÖNDER, Mehmet

2009-01-01

Abstract: In this paper, we give characterizations of dual timelike normal and dual timelike spherical curves in the dual Minkowski 3-space and we show that every dual timelike normal curve is also a dual timelike spherical curve. Keywords: Normal curves, Dual Minkowski 3-Space, Dual Timelike curves. Mathematics Subject Classifications (2000): 53C50, 53C40. DUAL MINKOWSKI UZAYINDA DUAL TIMELIKE NORMAL VE DUAL TIMELIKE KÜRESEL EĞRİLER Özet: Bu çalışmada, dual Minkowski 3-...
Fault tolerant architecture for artificial olfactory system

International Nuclear Information System (INIS)

Lotfivand, Nasser; Hamidon, Mohd Nizar; Abdolzadeh, Vida

2015-01-01

In this paper, to cover and mask the faults that occur in the sensing unit of an artificial olfactory system, a novel architecture is offered. The proposed architecture is able to tolerate failures in the sensors of the array and the faults that occur are masked. The proposed architecture for extracting the correct results from the output of the sensors can provide the quality of service for generated data from the sensor array. The results of various evaluations and analysis proved that the proposed architecture has acceptable performance in comparison with the classic form of the sensor array in gas identification. According to the results, achieving a high odor discrimination based on the suggested architecture is possible. (paper)
A Parallel Saturation Algorithm on Shared Memory Architectures

Science.gov (United States)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
[Architecture and movement].

Science.gov (United States)

Rivallan, Armel

2012-01-01

Leading an architectural project means accompanying the movement which it induces within the teams. Between questioning, uncertainty and fear, the organisational changes inherent to the new facility must be subject to constructive and ongoing exchanges. Ethics, safety and training are revised and the unit projects are sometimes modified.
Graphics Processing Units for HEP trigger systems

International Nuclear Information System (INIS)

Ammendola, R.; Bauce, M.; Biagioni, A.; Chiozzi, S.; Cotta Ramusino, A.; Fantechi, R.; Fiorini, M.; Giagu, S.; Gianoli, A.; Lamanna, G.; Lonardo, A.; Messina, A.

2016-01-01

General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.
Graphics Processing Units for HEP trigger systems

Energy Technology Data Exchange (ETDEWEB)

Ammendola, R. [INFN Sezione di Roma “Tor Vergata”, Via della Ricerca Scientifica 1, 00133 Roma (Italy); Bauce, M. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Biagioni, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Chiozzi, S.; Cotta Ramusino, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Fantechi, R. [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); CERN, Geneve (Switzerland); Fiorini, M. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Giagu, S. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); Gianoli, A. [INFN Sezione di Ferrara, Via Saragat 1, 44122 Ferrara (Italy); University of Ferrara, Via Saragat 1, 44122 Ferrara (Italy); Lamanna, G., E-mail: gianluca.lamanna@cern.ch [INFN Sezione di Pisa, Largo B. Pontecorvo 3, 56127 Pisa (Italy); INFN Laboratori Nazionali di Frascati, Via Enrico Fermi 40, 00044 Frascati (Roma) (Italy); Lonardo, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); Messina, A. [INFN Sezione di Roma “La Sapienza”, P.le A. Moro 2, 00185 Roma (Italy); University of Rome “La Sapienza”, P.lee A.Moro 2, 00185 Roma (Italy); and others

2016-07-11

General-purpose computing on GPUs (Graphics Processing Units) is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming ripe. We will discuss the use of online parallel computing on GPU for synchronous low level trigger, focusing on CERN NA62 experiment trigger system. The use of GPU in higher level trigger system is also briefly considered.

Architecture of Schools: The New Learning Environments.

Science.gov (United States)

Dudek, Mark

This guide focuses on the architecture of the primary and pre-school sector in the United Kingdom and broadly considers the subtle spatial and psychological requirements of growing children up to, and beyond, the age of sixteen. Chapter 1 examines the history, origins, and significant historical developments of school architecture, along with an…
Invasive tightly coupled processor arrays

CERN Document Server

LARI, VAHID

2016-01-01

This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...
A heterogeneous hierarchical architecture for real-time computing

Energy Technology Data Exchange (ETDEWEB)

Skroch, D.A.; Fornaro, R.J.

1988-12-01

The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
IAEA's dual function

International Nuclear Information System (INIS)

1967-01-01

'A factor of paramount importance is the dual nature of atomic energy, which is reflected in the dual function of the Agency; not only to promote, but also to safeguard the peaceful uses of atomic energy'. In taking the above statement as a theme in his address to the 1474th Plenary Meeting of the United Nations General Assembly (22nd November), the Director General, Dr. Sigvard Eklund, went on to speak of a few of the many areas in which society was feeling the impact of atomic energy. During the discussion which followed his report on the Agency's work nearly all speakers referred to the importance of the safeguards system as well as to positive achievements in developing nuclear potential for peaceful purposes
Real time 3D structural and Doppler OCT imaging on graphics processing units

Science.gov (United States)

Sylwestrzak, Marcin; Szlag, Daniel; Szkulmowski, Maciej; Gorczyńska, Iwona; Bukowska, Danuta; Wojtkowski, Maciej; Targowski, Piotr

2013-03-01

In this report the application of graphics processing unit (GPU) programming for real-time 3D Fourier domain Optical Coherence Tomography (FdOCT) imaging with implementation of Doppler algorithms for visualization of the flows in capillary vessels is presented. Generally, the time of the data processing of the FdOCT data on the main processor of the computer (CPU) constitute a main limitation for real-time imaging. Employing additional algorithms, such as Doppler OCT analysis, makes this processing even more time consuming. Lately developed GPUs, which offers a very high computational power, give a solution to this problem. Taking advantages of them for massively parallel data processing, allow for real-time imaging in FdOCT. The presented software for structural and Doppler OCT allow for the whole processing with visualization of 2D data consisting of 2000 A-scans generated from 2048 pixels spectra with frame rate about 120 fps. The 3D imaging in the same mode of the volume data build of 220 × 100 A-scans is performed at a rate of about 8 frames per second. In this paper a software architecture, organization of the threads and optimization applied is shown. For illustration the screen shots recorded during real time imaging of the phantom (homogeneous water solution of Intralipid in glass capillary) and the human eye in-vivo is presented.
APL on GPUs

DEFF Research Database (Denmark)

Henriksen, Troels; Dybdal, Martin; Urms, Henrik

2016-01-01

This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straight-forwardly interoperated with mainstream p...... programming environments, such as Python, for example for purposes of visualization and user interaction. Finally, empirical evaluation shows that the GPGPU translation achieves speedups up to hundreds of times faster than sequential C compiled code.......This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straight-forwardly interoperated with mainstream...
Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions

KAUST Repository

Klingbeil, Guido; Erban, Radek; Giles, Mike; Maini, Philip K.

2012-01-01

We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared memory and registers potentially sacrificing parallelism. The thin thread approach maximizes parallelism and tries to hide access latencies. We apply these two approaches to the parallel stochastic simulation of chemical reaction systems using the stochastic simulation algorithm (SSA) by Gillespie [14]. In these cases, the proposed thin thread approach shows comparable performance while eliminating the limitation of the reaction system's size. © 2006 IEEE.
Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions

KAUST Repository

Klingbeil, Guido

2012-02-01

We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared memory and registers potentially sacrificing parallelism. The thin thread approach maximizes parallelism and tries to hide access latencies. We apply these two approaches to the parallel stochastic simulation of chemical reaction systems using the stochastic simulation algorithm (SSA) by Gillespie [14]. In these cases, the proposed thin thread approach shows comparable performance while eliminating the limitation of the reaction system\\'s size. © 2006 IEEE.
Chained learning architectures in a simple closed-loop behavioural context

DEFF Research Database (Denmark)

Kulvicius, Tomas; Porr, Bernd; Wörgötter, Florentin

2007-01-01

are very simple and consist of single learning unit. The current study is trying to solve this problem focusing on chained learning architectures in a simple closed-loop behavioural context. METHODS: We applied temporal sequence learning (Porr B and Wörgötter F 2006) in a closed-loop behavioural system...... where a driving robot learns to follow a line. Here for the first time we introduced two types of chained learning architectures named linear chain and honeycomb chain. We analyzed such architectures in an open and closed-loop context and compared them to the simple learning unit. CONCLUSIONS...
The Dual Nature of Life Interplay of the Individual and the Genome

CERN Document Server

Zhegunov, Gennadiy

2012-01-01

Life is a diverse and ubiquitous phenomenon on Earth, characterized by fundamental features distinguishing living bodies from nonliving material. Yet it is also so complex that it has long defied precise definition. This book from a seasoned biologist offers new insights into the nature of life by illuminating a fascinating architecture of dualities inherent in its existence and propagation. Life is connected with individual living beings, yet it is also a collective and inherently global phenomenon of the material world. It embodies a dual existence of cycles of phenotypic life, and their unseen driver — an uninterrupted march of genetic information whose collective immortality is guaranteed by individual mortality. Although evolution propagates and tunes species of organisms, the beings produced can be regarded merely as tools for the survival and cloning of genomes written in an unchanging code. What are the physical versus informational bases and driving forces of life, and how do they unite as an integ...
Degree of conversion of resin-based orthodontic bonding materials cured with single-wave or dual-wave LED light-curing units.

Science.gov (United States)

Santini, Ario; McGuinness, Niall; Nor, Noor Azreen Md

2014-12-01

To evaluate the degree of conversion (DC) of orthodontic adhesives (RBOAs) cured with dual peak or single peak light-emitting diode (LED) light-curing units (LCUs). Standardized samples of RBOAs, APCPlus, Opal® Bond® and LightBond(TM) were prepared (n = 3) and cured with one of two dual peak LCUs (bluephase® G2-Ivoclar-Vivadent or Valo-Ultradent) or a single peak control (bluephase® Ivoclar-Vivadent). The DC was determined using micro-Raman spectroscopy. The presence or absence of initiators other than camphorquinone was confirmed by high-performance liquid chromatography and nuclear magnetic resonance spectroscopy. Data were analysed using general linear model in Minitab 15 (Minitab Inc., State College, PA, USA). There was no significant difference in DC between APCPlus, and Opal® Bond (confidence interval: -3.89- to 2.48); significant difference between APCPlus and LightBond(TM) (-18.55 to -12.18) and Opal® Bond and Lightbond(TM) (-17.85 to -11.48); no significant difference between bluephase (single peak) and dual peak LCUs, bluephase G2 (-4.896 to 1.476) and Valo (-3.935 to 2.437) and between bluephase G2 and Valo (-2.225 to 4.147). APCPlus and Opal® Bond showed higher DC values than LightBond(TM) (P<0.05). Lucirin® TPO was found only in Vit-l-escence. Lucirin® TPO was not identified in the three orthodontic adhesives. All three LCUs performed similarly with the orthodontic adhesives: orthodontic adhesive make had a greater effect on DC than the LCUs. It is strongly suggested that manufacturers of resin-based orthodontic materials test report whether or not dual peak LCUs should be used with their materials. Dual peak LED LCUs, though suitable in the majority of cases, may not be recommended for certain non Lucirin® TPO-containing materials. © 2014 British Orthodontic Society.
Portable LQCD Monte Carlo code using OpenACC

Science.gov (United States)

Bonati, Claudio; Calore, Enrico; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Fabio Schifano, Sebastiano; Silvi, Giorgio; Tripiccione, Raffaele

2018-03-01

Varying from multi-core CPU processors to many-core GPUs, the present scenario of HPC architectures is extremely heterogeneous. In this context, code portability is increasingly important for easy maintainability of applications; this is relevant in scientific computing where code changes are numerous and frequent. In this talk we present the design and optimization of a state-of-the-art production level LQCD Monte Carlo application, using the OpenACC directives model. OpenACC aims to abstract parallel programming to a descriptive level, where programmers do not need to specify the mapping of the code on the target machine. We describe the OpenACC implementation and show that the same code is able to target different architectures, including state-of-the-art CPUs and GPUs.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Science.gov (United States)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL
Single-unit-cell layer established Bi 2 WO 6 3D hierarchical architectures: Efficient adsorption, photocatalysis and dye-sensitized photoelectrochemical performance

Energy Technology Data Exchange (ETDEWEB)

Huang, Hongwei; Cao, Ranran; Yu, Shixin; Xu, Kang; Hao, Weichang; Wang, Yonggang; Dong, Fan; Zhang, Tierui; Zhang, Yihe

2017-12-01

Single-layer catalysis sparks huge interests and gains widespread attention owing to its high activity. Simultaneously, three-dimensional (3D) hierarchical structure can afford large surface area and abundant reactive sites, contributing to high efficiency. Herein, we report an absorbing single-unit-cell layer established Bi2WO6 3D hierarchical architecture fabricated by a sodium dodecyl benzene sulfonate (SDBS)-assisted assembled strategy. The DBS- long chains can adsorb on the (Bi2O2)2+ layers and hence impede stacking of the layers, resulting in the single-unit-cell layer. We also uncovered that SDS with a shorter chain is less effective than SDBS. Due to the sufficient exposure of surface O atoms, single-unit-cell layer 3D Bi2WO6 shows strong selectivity for adsorption on multiform organic dyes with different charges. Remarkably, the single-unit-cell layer 3D Bi2WO6 casts profoundly enhanced photodegradation activity and especially a superior photocatalytic H2 evolution rate, which is 14-fold increase in contrast to the bulk Bi2WO6. Systematic photoelectrochemical characterizations disclose that the substantially elevated carrier density and charge separation efficiency take responsibility for the strengthened photocatalytic performance. Additionally, the possibility of single-unit-cell layer 3D Bi2WO6 as dye-sensitized solar cells (DSSC) has also been attempted and it was manifested to be a promising dye-sensitized photoanode for oxygen evolution reaction (ORR). Our work not only furnish an insight into designing single-layer assembled 3D hierarchical architecture, but also offer a multi-functional material for environmental and energy applications.
High-throughput sequence alignment using Graphics Processing Units

Directory of Open Access Journals (Sweden)

Trapnell Cole

2007-12-01

Full Text Available Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and de novo genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.
A novel dual-functional MEMS sensor integrating both pressure and temperature units

Energy Technology Data Exchange (ETDEWEB)

Chen Tao; Zhang Zhaohua; Ren Tianling; Miao Gujin; Zhou Changjian; Lin Huiwang; Liu Litian, E-mail: RenTL@tsinghua.edu.c [National Laboratory for Information Science and Technology, Institute of Microelectronics, Tsinghua University, Beijing 100084 (China)

2010-07-15

This paper proposes a novel miniature dual-functional sensor integrating both pressure and temperature sensitive units on a single chip. The device wafer of SOI is used as a pizeoresistive diaphragm which features excellent consistency in thickness. The conventional anisotropic wet etching has been abandoned, while ICP etching has been employed to etch out the reference cave to minimize the area of individual device in the way that the 57.4{sup 0} slope has been eliminated. As a result, the average cost of the single chip is reduced. Two PN junctions with constant ratio of the areas of depletion regions have also been integrated on the same chip to serve as a temperature sensor, and each PN junction shows high linearity over -40 to 100 {sup 0}C and low power consumption. The iron implanting process for PN junction is exactly compatible with the piezoresistor, with no additional expenditure. The pressure sensitivity is 86 mV/MPa, while temperature sensitivity is 1.43 mV/{sup 0}C, both complying with the design objective.
A novel dual-functional MEMS sensor integrating both pressure and temperature units

International Nuclear Information System (INIS)

Chen Tao; Zhang Zhaohua; Ren Tianling; Miao Gujin; Zhou Changjian; Lin Huiwang; Liu Litian

2010-01-01

This paper proposes a novel miniature dual-functional sensor integrating both pressure and temperature sensitive units on a single chip. The device wafer of SOI is used as a pizeoresistive diaphragm which features excellent consistency in thickness. The conventional anisotropic wet etching has been abandoned, while ICP etching has been employed to etch out the reference cave to minimize the area of individual device in the way that the 57.4 0 slope has been eliminated. As a result, the average cost of the single chip is reduced. Two PN junctions with constant ratio of the areas of depletion regions have also been integrated on the same chip to serve as a temperature sensor, and each PN junction shows high linearity over -40 to 100 0 C and low power consumption. The iron implanting process for PN junction is exactly compatible with the piezoresistor, with no additional expenditure. The pressure sensitivity is 86 mV/MPa, while temperature sensitivity is 1.43 mV/ 0 C, both complying with the design objective.
Mobility-limiting mechanisms in single and dual channel strained Si/SiGe MOSFETs

International Nuclear Information System (INIS)

Olsen, S.H.; Dobrosz, P.; Escobedo-Cousin, E.; Bull, S.J.; O'Neill, A.G.

2005-01-01

Dual channel strained Si/SiGe CMOS architectures currently receive great attention due to maximum performance benefits being predicted for both n- and p-channel MOSFETs. Epitaxial growth of a compressively strained SiGe layer followed by tensile strained Si can create a high mobility buried hole channel and a high mobility surface electron channel on a single relaxed SiGe virtual substrate. However, dual channel n-MOSFETs fabricated using a high thermal budget exhibit compromised mobility enhancements compared with single channel devices, in which both electron and hole channels form in strained Si. This paper investigates the mobility-limiting mechanisms of dual channel structures. The first evidence of increased interface roughness due to the introduction of compressively strained SiGe below the tensile strained Si channel is presented. Interface corrugations degrade electron mobility in the strained Si. Roughness measurements have been carried out using AFM and TEM. Filtering AFM images allowed roughness at wavelengths pertinent to carrier transport to be studied and the results are in agreement with electrical data. Furthermore, the first comparison of strain measurements in the surface channels of single and dual channel architectures is presented. Raman spectroscopy has been used to study channel strain both before and after processing and indicates that there is no impact of the buried SiGe layer on surface macrostrain. The results provide further evidence that the improved performance of the single channel devices fabricated using a high thermal budget arises from improved surface roughness and reduced Ge diffusion into the Si channel
The Explicit Determinations Of Dual Plane Curves And Dual Helices In Terms Of Its Dual Curvature And Dual Torsion

OpenAIRE

Lee Jae Won; Choi Jin Ho; Jin Dae Ho

2014-01-01

In this paper, we give the explicit determinations of dual plane curves, general dual helices and dual slant helices in terms of its dual curvature and dual torsion as a fundamental theory of dual curves in a dual 3-space
Experiences with High-Level Programming Directives for Porting Applications to GPUs

International Nuclear Information System (INIS)

Ding, Wei; Chapman, Barbara; Sankaran, Ramanan; Graham, Richard L.

2012-01-01

HPC systems now exploit GPUs within their compute nodes to accelerate program performance. As a result, high-end application development has become extremely complex at the node level. In addition to restructuring the node code to exploit the cores and specialized devices, the programmer may need to choose a programming model such as OpenMP or CPU threads in conjunction with an accelerator programming model to share and manage the difference node resources. This comes at a time when programmer productivity and the ability to produce portable code has been recognized as a major concern. In order to offset the high development cost of creating CUDA or OpenCL kernels, directives have been proposed for programming accelerator devices, but their implications are not well known. In this paper, we evaluate the state of the art accelerator directives to program several applications kernels, explore transformations to achieve good performance, and examine the expressiveness and performance penalty of using high-level directives versus CUDA. We also compare our results to OpenMP implementations to understand the benefits of running the kernels in the accelerator versus CPU cores.

Novel fault tolerant modular system architecture for I and C applications

International Nuclear Information System (INIS)

Kumar, Ankit; Venkatesan, A.; Madhusoodanan, K.

2013-01-01

Novel fault tolerant 3U modular system architecture has been developed for safety related and safety critical I and C systems of the reactor. Design innovatively utilizes simplest multi-drop serial bus called Inter-Integrated Circuits (I 2 C) Bus for system operation with simplicity, fault tolerance and online maintainability (hot swap). I 2 C bus failure modes analysis was done and system design was hardened for possible failure modes. System backplane uses only passive components, dual redundant I 2 C buses, data consistency checks and geographical addressing scheme to tackle bus lock ups/stuck buses and bit flips in data transactions. Dual CPU active/standby redundancy architecture with hot swap implements tolerance for CPU software stuck up conditions and hardware faults. System cards implement hot swap for online maintainability, power supply fault containment, communication buses fault containment and I/O channel to channel isolation and independency. Typical applications for pure hardwired (without real time software) Core Temperature Monitoring System for FBRs, as a Universal Signal Conditioning System for safety related I and C systems and as a complete control system for non nuclear safety systems have also been discussed. (author)
Establishment of animal model of dual liver transplantation in rat.

Directory of Open Access Journals (Sweden)

Ying Zhang

Full Text Available The animal model of the whole-size and reduced-size liver transplantation in both rat and mouse has been successfully established. Because of the difficulties and complexities in microsurgical technology, the animal model of dual liver transplantation was still not established for twelve years since the first human dual liver transplantation has been made a success. There is an essential need to establish this animal model to lay a basic foundation for clinical practice. To study the physiological and histopathological changes of dual liver transplantation, "Y" type vein from the cross part between vena cava and two iliac of donor and "Y' type prosthesis were employed to recanalize portal vein and the bile duct between dual liver grafts and recipient. The dual right upper lobes about 45-50% of the recipient liver volume were taken as donor, one was orthotopically implanted at its original position, the other was rotated 180° sagitally and heterotopically positioned in the left upper quadrant. Microcirculation parameters, liver function, immunohistochemistry and survival were analyzed to evaluate the function of dual liver grafts. No significant difference in the hepatic microcirculatory flow was found between two grafts in the first 90 minutes after reperfusion. Light and electronic microscope showed the liver architecture was maintained without obvious features of cellular destruction and the continuity of the endothelium was preserved. Only 3 heterotopically positioned graft appeared patchy desquamation of endothelial cell, mitochondrial swelling and hepatocytes cytoplasmic vacuolization. Immunohistochemistry revealed there is no difference in hepatocyte activity and the ability of endothelia to contract and relax after reperfusion between dual grafts. Dual grafts made a rapid amelioration of liver function after reperfusion. 7 rats survived more than 7 days with survival rate of 58.3.%. Using "Y" type vein and bile duct prosthesis, we
Performance studies of GooFit on GPUs vs RooFit on CPUs while estimating the statistical significance of a new physical signal

Science.gov (United States)

Di Florio, Adriano

2017-10-01

In order to test the computing capabilities of GPUs with respect to traditional CPU cores a high-statistics toy Monte Carlo technique has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose to estimate the statistical significance of the structure observed by CMS close to the kinematical boundary of the J/ψϕ invariant mass in the three-body decay B + → J/ψϕK +. GooFit is a data analysis open tool under development that interfaces ROOT/RooFit to CUDA platform on nVidia GPU. The optimized GooFit application running on GPUs hosted by servers in the Bari Tier2 provides striking speed-up performances with respect to the RooFit application parallelised on multiple CPUs by means of PROOF-Lite tool. The considerable resulting speed-up, evident when comparing concurrent GooFit processes allowed by CUDA Multi Process Service and a RooFit/PROOF-Lite process with multiple CPU workers, is presented and discussed in detail. By means of GooFit it has also been possible to explore the behaviour of a likelihood ratio test statistic in different situations in which the Wilks Theorem may or may not apply because its regularity conditions are not satisfied.
Implementations of a four-level mechanical architecture for fault-tolerant robots

International Nuclear Information System (INIS)

Hooper, Richard; Sreevijayan, Dev; Tesar, Delbert; Geisinger, Joseph; Kapoor, Chelan

1996-01-01

This paper describes a fault tolerant mechanical architecture with four levels devised and implemented in concert with NASA (Tesar, D. and Sreevijayan, D., Four-level fault tolerance in manipulator design for space operations. In First Int. Symp. Measurement and Control in Robotics (ISMCR '90), Houston, Texas, 20-22 June 1990.) Subsequent work has clarified and revised the architecture. The four levels proceed from fault tolerance at the actuator level, to fault tolerance via in-parallel chains, to fault tolerance using serial kinematic redundancy, and finally to the fault tolerance multiple arm systems provide. This is a subsumptive architecture because each successive layer can incorporate the fault tolerance provided by all layers beneath. For instance a serially-redundant robot can incorporate dual fault-tolerant actuators. Redundant systems provide the fault tolerance, but the guiding principle of this architecture is that functional redundancies actively increase the performance of the system. Redundancies do not simply remain dormant until needed. This paper includes specific examples of hardware and/or software implementation at all four levels
Dual (oxygen and nitrogen) isotopic characterization of the museum archived nitrates from the United States of America, South Africa and Australia.

Science.gov (United States)

Mizota, Chitoshi; Hosono, Takahiro; Matsunaga, Midori; Okumura, Azusa

2018-06-01

Dual (oxygen and nitrogen) isotopic composition of the museum archived nitrates from the United States of America, South Africa and Australia was studied. The analyzed specimens were collected in middle 19th to early 20th centuries, and represent world-wide acquisition of the Smithsonian Institution Natural Museum of Natural History (Washington, D. C., USA) and the Natural History Museum (London, UK). The samples consist of transparent to semi-transparent aggregates of minute nitrate, euhedral crystallites which imply precipitation from percolating fluids under ample space and dry regimes. The major nitrate chemistry is saltpetre (KNO 3 ) with minor nitratine (NaNO 3 ). A binary plot of δ 15 N vs. δ 18 O of almost all nitrates indicates a trend, reflecting microbial origin through nitrification of ammonium. The diagram excludes the contribution of meteoric origin formed by mass-independent, photochemical reaction of NO with ozone in stratosphere. Calculated paleo-ambient fluid compositions responsible for microbial nitrification imply extreme evaporative concentration of relevant fluids under dry climatic regimes in the Northern Cape Province (South Africa) and in the Northern Territory (central Australia), and even throughout the United States of America. The dual isotopic characterization provides direct evidence to the origin of the museum archived nitrates. Copyright © 2017 Elsevier B.V. All rights reserved.
Proposed hardware architectures of particle filter for object tracking

Science.gov (United States)

Abd El-Halym, Howida A.; Mahmoud, Imbaby Ismail; Habib, SED

2012-12-01

In this article, efficient hardware architectures for particle filter (PF) are presented. We propose three different architectures for Sequential Importance Resampling Filter (SIRF) implementation. The first architecture is a two-step sequential PF machine, where particle sampling, weight, and output calculations are carried out in parallel during the first step followed by sequential resampling in the second step. For the weight computation step, a piecewise linear function is used instead of the classical exponential function. This decreases the complexity of the architecture without degrading the results. The second architecture speeds up the resampling step via a parallel, rather than a serial, architecture. This second architecture targets a balance between hardware resources and the speed of operation. The third architecture implements the SIRF as a distributed PF composed of several processing elements and central unit. All the proposed architectures are captured using VHDL synthesized using Xilinx environment, and verified using the ModelSim simulator. Synthesis results confirmed the resource reduction and speed up advantages of our architectures.
Accelerating image reconstruction in dual-head PET system by GPU and symmetry properties.

Directory of Open Access Journals (Sweden)

Cheng-Ying Chou

Full Text Available Positron emission tomography (PET is an important imaging modality in both clinical usage and research studies. We have developed a compact high-sensitivity PET system that consisted of two large-area panel PET detector heads, which produce more than 224 million lines of response and thus request dramatic computational demands. In this work, we employed a state-of-the-art graphics processing unit (GPU, NVIDIA Tesla C2070, to yield an efficient reconstruction process. Our approaches ingeniously integrate the distinguished features of the symmetry properties of the imaging system and GPU architectures, including block/warp/thread assignments and effective memory usage, to accelerate the computations for ordered subset expectation maximization (OSEM image reconstruction. The OSEM reconstruction algorithms were implemented employing both CPU-based and GPU-based codes, and their computational performance was quantitatively analyzed and compared. The results showed that the GPU-accelerated scheme can drastically reduce the reconstruction time and thus can largely expand the applicability of the dual-head PET system.
75 FR 5637 - Culturally Significant Objects Imported for Exhibition Determinations: “Architecture as Icon...

Science.gov (United States)

2010-02-03

... Determinations: ``Architecture as Icon: Perception and Representation of Architecture in Byzantine Art'' SUMMARY... objects to be included in the exhibition ``Architecture as Icon: Perception and Representation of Architecture in Byzantine Art,'' imported from abroad for temporary exhibition within the United States, are of...
Robust equivalent consumption-based controllers for a dual-mode diesel parallel HEV

International Nuclear Information System (INIS)

Finesso, Roberto; Spessa, Ezio; Venditti, Mattia

2016-01-01

Highlights: • Non-plug-in dual-mode parallel hybrid architecture. • Cross-validation machine-learning for robust equivalent consumption-based controllers. • Optimal control strategy based on fuel consumption, NOx and battery aging. • Impact of different equivalent consumption definitions on HEV performance. • Correlation between vehicle braking energy and SOC variation in the traction stages. - Abstract: New equivalent consumption minimization strategy (ECMS) tools have been developed and applied to identify the optimal control strategy of a dual-mode parallel hybrid electric vehicle equipped with a compression-ignition engine. In this architecture, the electric machine is coupled to the engine through either a single-speed gearbox (torque-coupling) or a planetary gear set (speed-coupling). One of the main novelties of the present study concerns the definition of the instantaneous equivalent consumption (EC) function, which takes into account not only fuel consumption (FC) and the energy flow through the electric components, but also NO_x emissions, battery aging, and the battery SOC. The EC function has been trained using a cross-validation machine-learning technique, based on a genetic algorithm, where the training data set has been selected in order to maximize performances over a testing data set. The adoption of this technique, in conjunction with the new definition of EC, have led to the identification of very robust controllers, which provide an accurate control for different driving scenarios, even when the EC function is not specifically trained on the same missions over which it is tested. To this aim, a data set of fifty driving cycles and six user-defined missions, which cover a total distance of 70–100 km, has been considered as a training driving set. The ECMS controllers can be implemented in a vehicle control unit, and their performance has resulted to be close to that of a dynamic programming tool, which has here been used as benchmark
Two Dual Ion Spectrometer Flight Units of the Fast Plasma Instrument Suite (FPI) for the Magnetospheric Multiscale Mission (MMS)

Science.gov (United States)

Adams, Mitzi

2014-01-01

Two Dual Ion Spectrometer flight units of the Fast Plasma Instrument Suite (FPI) for the Magnetospheric Multiscale Mission (MMS) have returned to MSFC for flight testing. Anticipated to begin on June 30, tests will ensue in the Low Energy Electron and Ion Facility of the Heliophysics and Planetary Science Office (ZP13), managed by Dr. Victoria Coffey of the Natural Environments Branch of the Engineering Directorate (EV44). The MMS mission consists of four identical spacecraft, whose purpose is to study magnetic reconnection in the boundary regions of Earth's magnetosphere.
Multi-GPU configuration of 4D intensity modulated radiation therapy inverse planning using global optimization

Science.gov (United States)

Hagan, Aaron; Sawant, Amit; Folkerts, Michael; Modiri, Arezoo

2018-01-01

We report on the design, implementation and characterization of a multi-graphic processing unit (GPU) computational platform for higher-order optimization in radiotherapy treatment planning. In collaboration with a commercial vendor (Varian Medical Systems, Palo Alto, CA), a research prototype GPU-enabled Eclipse (V13.6) workstation was configured. The hardware consisted of dual 8-core Xeon processors, 256 GB RAM and four NVIDIA Tesla K80 general purpose GPUs. We demonstrate the utility of this platform for large radiotherapy optimization problems through the development and characterization of a parallelized particle swarm optimization (PSO) four dimensional (4D) intensity modulated radiation therapy (IMRT) technique. The PSO engine was coupled to the Eclipse treatment planning system via a vendor-provided scripting interface. Specific challenges addressed in this implementation were (i) data management and (ii) non-uniform memory access (NUMA). For the former, we alternated between parameters over which the computation process was parallelized. For the latter, we reduced the amount of data required to be transferred over the NUMA bridge. The datasets examined in this study were approximately 300 GB in size, including 4D computed tomography images, anatomical structure contours and dose deposition matrices. For evaluation, we created a 4D-IMRT treatment plan for one lung cancer patient and analyzed computation speed while varying several parameters (number of respiratory phases, GPUs, PSO particles, and data matrix sizes). The optimized 4D-IMRT plan enhanced sparing of organs at risk by an average reduction of 26% in maximum dose, compared to the clinical optimized IMRT plan, where the internal target volume was used. We validated our computation time analyses in two additional cases. The computation speed in our implementation did not monotonically increase with the number of GPUs. The optimal number of GPUs (five, in our study) is directly related to the
BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics.

Science.gov (United States)

Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A

2012-01-01

Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
Low Power Implementation of Non Power-of-Two FFTs on Coarse-Grain Reconfigurable Architectures

NARCIS (Netherlands)

Zhang, Q.; Wolkotte, P.T.; Smit, Gerardus Johannes Maria; Rivaton, Arnaud; Quevremont, Jérôme

2005-01-01

The DRM standard for digital radio broadcast in the AM band requires integrated devices for radio receivers at very low power. A System on Chip (SoC) call DiMITRI was developed based on a dual ARM9 RISC core architecture. Analyses showed that most computation power is used in the Coded Orthogonal
Modular, Cost-Effective, Extensible Avionics Architecture for Secure, Mobile Communications

Science.gov (United States)

Ivancic, William D.

2007-01-01

Current onboard communication architectures are based upon an all-in-one communications management unit. This unit and associated radio systems has regularly been designed as a one-off, proprietary system. As such, it lacks flexibility and cannot adapt easily to new technology, new communication protocols, and new communication links. This paper describes the current avionics communication architecture and provides a historical perspective of the evolution of this system. A new onboard architecture is proposed that allows full use of commercial-off-the-shelf technologies to be integrated in a modular approach thereby enabling a flexible, cost-effective and fully deployable design that can take advantage of ongoing advances in the computer, cryptography, and telecommunications industries.
From green architecture to architectural green

DEFF Research Database (Denmark)

Earon, Ofri

2011-01-01

that describes the architectural exclusivity of this particular architecture genre. The adjective green expresses architectural qualities differentiating green architecture from none-green architecture. Currently, adding trees and vegetation to the building’s facade is the main architectural characteristics...... they have overshadowed the architectural potential of green architecture. The paper questions how a green space should perform, look like and function. Two examples are chosen to demonstrate thorough integrations between green and space. The examples are public buildings categorized as pavilions. One......The paper investigates the topic of green architecture from an architectural point of view and not an energy point of view. The purpose of the paper is to establish a debate about the architectural language and spatial characteristics of green architecture. In this light, green becomes an adjective...
Connectivity: Performance Portable Algorithms for graph connectivity v. 0.1

Energy Technology Data Exchange (ETDEWEB)

2017-09-21

Graphs occur in several places in real world from road networks, social networks and scientific simulations. Connectivity is a graph analysis software to graph connectivity in modern architectures like multicore CPUs, Xeon Phi and GPUs.
Dual-energy subtraction radiography of the breast

International Nuclear Information System (INIS)

Asaga, Taro; Masuzawa, Chihiro; Kawahara, Satoru; Motohashi, Hisahiko; Okamoto, Takashi; Tamura, Nobuo

1988-01-01

Dual-energy projection radiography was applied to breast examination. To perform the dual-energy subtraction radiography using a digital radiography unit, high and low-energy exposures were made at an appropriate time interval under differing X-ray exposure conditions. Dual-energy subtraction radiography was performed in 41 cancer patients in whom the tumor shadow was equivocal or the border of cancer infiltration was not clearly demonstrated by compression mammography, and 15 patients with benign diseases such as fibrocystic disease, cyst and fibroadenoma. In 21 cases out of the 41 cancer patients, the dual-energy subtraction radiography clearly visualized the malignant tumor shadows and the border of cancer infiltration and the daughter nodules by removing the shadows of normal mammary gland. On the other hand, beign diseases such as fibrocystic disease and cyst could be diagnosed as such, because the tumor shadow and the irregularly concentrated image of mammary gland disappeared by the dual-energy subtraction. These results suggest that this new technique will be useful in examination of breast masses. (author)
Dual-energy subtraction radiography of the breast

Energy Technology Data Exchange (ETDEWEB)

Asaga, Taro; Masuzawa, Chihiro; Kawahara, Satoru; Motohashi, Hisahiko; Okamoto, Takashi; Tamura, Nobuo

1988-06-01

Dual-energy projection radiography was applied to breast examination. To perform the dual-energy subtraction radiography using a digital radiography unit, high and low-energy exposures were made at an appropriate time interval under differing X-ray exposure conditions. Dual-energy subtraction radiography was performed in 41 cancer patients in whom the tumor shadow was equivocal or the border of cancer infiltration was not clearly demonstrated by compression mammography, and 15 patients with benign diseases such as fibrocystic disease, cyst and fibroadenoma. In 21 cases out of the 41 cancer patients, the dual-energy subtraction radiography clearly visualized the malignant tumor shadows and the border of cancer infiltration and the daughter nodules by removing the shadows of normal mammary gland. On the other hand, beign diseases such as fibrocystic disease and cyst could be diagnosed as such, because the tumor shadow and the irregularly concentrated image of mammary gland disappeared by the dual-energy subtraction. These results suggest that this new technique will be useful in examination of breast masses.
Towards industry strength mapping of AUTOSAR automotive functionality on multicore architectures

DEFF Research Database (Denmark)

Avasalcai, Cosmin Florin; Budhrani, Dhanesh; Pop, Paul

2017-01-01

The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, consisting of several multicore ECUs. In addition, multicore ECUs are being adopted because of better performance,...... engineer in the mapping task. We have successfully evaluated AUTOMAP on several realistic use cases from Volvo Trucks....
IMPLICIT DUAL CONTROL BASED ON PARTICLE FILTERING AND FORWARD DYNAMIC PROGRAMMING.

Science.gov (United States)

Bayard, David S; Schumitzky, Alan

2010-03-01

This paper develops a sampling-based approach to implicit dual control. Implicit dual control methods synthesize stochastic control policies by systematically approximating the stochastic dynamic programming equations of Bellman, in contrast to explicit dual control methods that artificially induce probing into the control law by modifying the cost function to include a term that rewards learning. The proposed implicit dual control approach is novel in that it combines a particle filter with a policy-iteration method for forward dynamic programming. The integration of the two methods provides a complete sampling-based approach to the problem. Implementation of the approach is simplified by making use of a specific architecture denoted as an H-block. Practical suggestions are given for reducing computational loads within the H-block for real-time applications. As an example, the method is applied to the control of a stochastic pendulum model having unknown mass, length, initial position and velocity, and unknown sign of its dc gain. Simulation results indicate that active controllers based on the described method can systematically improve closed-loop performance with respect to other more common stochastic control approaches.

Architecture on Architecture

DEFF Research Database (Denmark)

Olesen, Karen

2016-01-01

that is not scientific or academic but is more like a latent body of data that we find embedded in existing works of architecture. This information, it is argued, is not limited by the historical context of the work. It can be thought of as a virtual capacity – a reservoir of spatial configurations that can...... correlation between the study of existing architectures and the training of competences to design for present-day realities.......This paper will discuss the challenges faced by architectural education today. It takes as its starting point the double commitment of any school of architecture: on the one hand the task of preserving the particular knowledge that belongs to the discipline of architecture, and on the other hand...
An Intelligent Agent based Architecture for Visual Data Mining

OpenAIRE

Hamdi Ellouzi; Hela Ltifi; Mounir Ben Ayed

2016-01-01

the aim of this paper is to present an intelligent architecture of Decision Support System (DSS) based on visual data mining. This architecture applies the multi-agent technology to facilitate the design and development of DSS in complex and dynamic environment. Multi-Agent Systems add a high level of abstraction. To validate the proposed architecture, it is implemented to develop a distributed visual data mining based DSS to predict nosocomial infectionsoccurrence in intensive care units. Th...
Design and analysis of a dual mode CMOS field programmable analog array

International Nuclear Information System (INIS)

Cheng Xiaoyan; Yang Haigang; Yin Tao; Wu Qisong; Zhang Hongfeng; Liu Fei

2014-01-01

This paper presents a novel field-programmable analog array (FPAA) architecture featuring a dual mode including discrete-time (DT) and continuous-time (CT) operation modes, along with a highly routable connection boxes (CBs) based interconnection lattice. The dual mode circuit for the FPAA is capable of achieving targeted optimal performance in different applications. The architecture utilizes routing switches in a CB not only for the signal interconnection purpose but also for control of the electrical charge transfer required in switched-capacitor circuits. This way, the performance of the circuit in either mode shall not be hampered with adding of programmability. The proposed FPAA is designed and implemented in a 0.18 μm standard CMOS process with a 3.3 V supply voltage. The result from post-layout simulation shows that a maximum bandwidth of 265 MHz through the interconnection network is achieved. The measured results from demonstrated examples show that the maximum signal bandwidth of up to 2 MHz in CT mode is obtained with the spurious free dynamic range of 54 dB, while the signal processing precision in DT mode reaches 96.4%. (semiconductor integrated circuits)
High Efficiency EBCOT with Parallel Coding Architecture for JPEG2000

Directory of Open Access Journals (Sweden)

Chiang Jen-Shiun

2006-01-01

Full Text Available This work presents a parallel context-modeling coding architecture and a matching arithmetic coder (MQ-coder for the embedded block coding (EBCOT unit of the JPEG2000 encoder. Tier-1 of the EBCOT consumes most of the computation time in a JPEG2000 encoding system. The proposed parallel architecture can increase the throughput rate of the context modeling. To match the high throughput rate of the parallel context-modeling architecture, an efficient pipelined architecture for context-based adaptive arithmetic encoder is proposed. This encoder of JPEG2000 can work at 180 MHz to encode one symbol each cycle. Compared with the previous context-modeling architectures, our parallel architectures can improve the throughput rate up to 25%.
Proactive Modeling of Market, Product and Production Architectures

DEFF Research Database (Denmark)

Mortensen, Niels Henrik; Hansen, Christian Lindschou; Hvam, Lars

2011-01-01

This paper presents an operational model that allows description of market, products and production architectures. The main feature of this model is the ability to describe both structural and functional aspect of architectures. The structural aspect is an answer to the question: What constitutes...... the architecture, e.g. standard designs, design units and interfaces? The functional aspect is an answer to the question: What is the behaviour or the architecture, what is it able to do, i.e. which products at which performance levels can be derived from the architecture? Among the most important benefits...... of this model is the explicit ability to describe what the architecture is prepared for, and what it is not prepared for - concerning development of future derivative products. The model has been applied in a large scale global product development project. Among the most important benefits is contribution to...
Running Parallel Discrete Event Simulators on Sierra

Energy Technology Data Exchange (ETDEWEB)

Barnes, P. D. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Jefferson, D. R. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

2015-12-03

In this proposal we consider porting the ROSS/Charm++ simulator and the discrete event models that run under its control so that they run on the Sierra architecture and make efficient use of the Volta GPUs.
How General-Purpose can a GPU be?

Directory of Open Access Journals (Sweden)

Philip Machanick

2015-12-01

Full Text Available The use of graphics processing units (GPUs in general-purpose computation (GPGPU is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. This paper proposes a reconceptualization of the GPU as a multicore design with minimal exotic modes of parallelism so as to make GPGPU truly general.
Manycore processing of repeated range queries over massive moving objects observations

DEFF Research Database (Denmark)

Lettich, Francesco; Orlando, Salvatore; Silvestri, Claudio

2014-01-01

decomposition and allows to tackle effectively a broad range of spatial object distributions, even those very skewed. Also, to deal with the architectural peculiarities and limitations of the GPUs, we adopt non-trivial GPU data structures that avoid the need of locked memory accesses and favour coalesced memory...... accesses, thus enhancing the overall memory throughput. To the best of our knowledge this is the first work that exploits GPUs to efficiently solve repeated range queries over massive sets of continuously moving objects, characterized by highly skewed spatial distributions. In comparison with state...
Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

Energy Technology Data Exchange (ETDEWEB)

Clark, M. A. [NVIDIA Corp., Santa Clara; Strelchenko, Alexei [Fermilab; Vaquero, Alejandro [Utah U.; Wagner, Mathias [NVIDIA Corp., Santa Clara; Weinberg, Evan [Boston U.

2017-10-26

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.
Acceleration of the OpenFOAM-based MHD solver using graphics processing units

International Nuclear Information System (INIS)

He, Qingyun; Chen, Hongli; Feng, Jingchao

2015-01-01

Highlights: • A 3D PISO-MHD was implemented on Kepler-class graphics processing units (GPUs) using CUDA technology. • A consistent and conservative scheme is used in the code which was validated by three basic benchmarks in a rectangular and round ducts. • Parallelized of CPU and GPU acceleration were compared relating to single core CPU in MHD problems and non-MHD problems. • Different preconditions for solving MHD solver were compared and the results showed that AMG method is better for calculations. - Abstract: The pressure-implicit with splitting of operators (PISO) magnetohydrodynamics MHD solver of the couple of Navier–Stokes equations and Maxwell equations was implemented on Kepler-class graphics processing units (GPUs) using the CUDA technology. The solver is developed on open source code OpenFOAM based on consistent and conservative scheme which is suitable for simulating MHD flow under strong magnetic field in fusion liquid metal blanket with structured or unstructured mesh. We verified the validity of the implementation on several standard cases including the benchmark I of Shercliff and Hunt's cases, benchmark II of fully developed circular pipe MHD flow cases and benchmark III of KIT experimental case. Computational performance of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms and meshes. The resulted showed that a GPU (GTX 770) can outperform a server-class 4-core, 8-thread CPU (Intel Core i7-4770k) by a factor of 2 at least.
Acceleration of the OpenFOAM-based MHD solver using graphics processing units

Energy Technology Data Exchange (ETDEWEB)

He, Qingyun; Chen, Hongli, E-mail: hlchen1@ustc.edu.cn; Feng, Jingchao

2015-12-15

Highlights: • A 3D PISO-MHD was implemented on Kepler-class graphics processing units (GPUs) using CUDA technology. • A consistent and conservative scheme is used in the code which was validated by three basic benchmarks in a rectangular and round ducts. • Parallelized of CPU and GPU acceleration were compared relating to single core CPU in MHD problems and non-MHD problems. • Different preconditions for solving MHD solver were compared and the results showed that AMG method is better for calculations. - Abstract: The pressure-implicit with splitting of operators (PISO) magnetohydrodynamics MHD solver of the couple of Navier–Stokes equations and Maxwell equations was implemented on Kepler-class graphics processing units (GPUs) using the CUDA technology. The solver is developed on open source code OpenFOAM based on consistent and conservative scheme which is suitable for simulating MHD flow under strong magnetic field in fusion liquid metal blanket with structured or unstructured mesh. We verified the validity of the implementation on several standard cases including the benchmark I of Shercliff and Hunt's cases, benchmark II of fully developed circular pipe MHD flow cases and benchmark III of KIT experimental case. Computational performance of the GPU implementation was examined by comparing its double precision run times with those of essentially the same algorithms and meshes. The resulted showed that a GPU (GTX 770) can outperform a server-class 4-core, 8-thread CPU (Intel Core i7-4770k) by a factor of 2 at least.
Dual-anticipating, dual and dual-lag synchronization in modulated time-delayed systems

International Nuclear Information System (INIS)

Ghosh, Dibakar; Chowdhury, A. Roy

2010-01-01

In this Letter, dual synchronization in modulated time delay system using delay feedback controller is proposed. Based on Lyapunov stability theory, we suggest a general method to achieve the dual-anticipating, dual, dual-lag synchronization of time-delayed chaotic systems and we find both its existing and sufficient stability conditions. Numerically it is shown that the dual synchronization is also possible when driving system contain two completely different systems. Effect of parameter mismatch on dual synchronization is also discussed. As an example, numerical simulations for the Mackey-Glass and Ikeda systems are conducted, which is in good agreement with the theoretical analysis.
An agent based architecture for high-risk neonate management at neonatal intensive care unit.

Science.gov (United States)

Malak, Jaleh Shoshtarian; Safdari, Reza; Zeraati, Hojjat; Nayeri, Fatemeh Sadat; Mohammadzadeh, Niloofar; Farajollah, Seide Sedighe Seied

2018-01-01

In recent years, the use of new tools and technologies has decreased the neonatal mortality rate. Despite the positive effect of using these technologies, the decisions are complex and uncertain in critical conditions when the neonate is preterm or has a low birth weight or malformations. There is a need to automate the high-risk neonate management process by creating real-time and more precise decision support tools. To create a collaborative and real-time environment to manage neonates with critical conditions at the NICU (Neonatal Intensive Care Unit) and to overcome high-risk neonate management weaknesses by applying a multi agent based analysis and design methodology as a new solution for NICU management. This study was a basic research for medical informatics method development that was carried out in 2017. The requirement analysis was done by reviewing articles on NICU Decision Support Systems. PubMed, Science Direct, and IEEE databases were searched. Only English articles published after 1990 were included; also, a needs assessment was done by reviewing the extracted features and current processes at the NICU environment where the research was conducted. We analyzed the requirements and identified the main system roles (agents) and interactions by a comparative study of existing NICU decision support systems. The Universal Multi Agent Platform (UMAP) was applied to implement a prototype of our multi agent based high-risk neonate management architecture. Local environment agents interacted inside a container and each container interacted with external resources, including other NICU systems and consultation centers. In the NICU container, the main identified agents were reception, monitoring, NICU registry, and outcome prediction, which interacted with human agents including nurses and physicians. Managing patients at the NICU units requires online data collection, real-time collaboration, and management of many components. Multi agent systems are applied as
Assembly of finite element methods on graphics processors

KAUST Repository

Cecka, Cris; Lew, Adrian J.; Darve, E.

2010-01-01

in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are created and analyzed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing
Systematic approach in optimizing numerical memory-bound kernels on GPU

KAUST Repository

Abdelfattah, Ahmad; Keyes, David E.; Ltaief, Hatem

2013-01-01

memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV
Decentralized and Modular Electrical Architecture

Science.gov (United States)

Elisabelar, Christian; Lebaratoux, Laurence

2014-08-01

This paper presents the studies made on the definition and design of a decentralized and modular electrical architecture that can be used for power distribution, active thermal control (ATC), standard inputs-outputs electrical interfaces.Traditionally implemented inside central unit like OBC or RTU, these interfaces can be dispatched in the satellite by using MicroRTU.CNES propose a similar approach of MicroRTU. The system is based on a bus called BRIO (Bus Réparti des IO), which is composed, by a power bus and a RS485 digital bus. BRIO architecture is made with several miniature terminals called BTCU (BRIO Terminal Control Unit) distributed in the spacecraft.The challenge was to design and develop the BTCU with very little volume, low consumption and low cost. The standard BTCU models are developed and qualified with a configuration dedicated to ATC, while the first flight model will fly on MICROSCOPE for PYRO actuations and analogue acquisitions. The design of the BTCU is made in order to be easily adaptable for all type of electric interface needs.Extension of this concept is envisaged for power conditioning and distribution unit, and a Modular PCDU based on BRIO concept is proposed.
A Design of Dual Broadband Antenna in Mobile Communication System

Directory of Open Access Journals (Sweden)

Jianming Zhou

2015-01-01

Full Text Available A design of dual broadband antenna is proposed in this paper; it consists of one low frequency unit and two high frequency units. The low frequency unit consists of a pair of printing vibrators; the high frequency unit consists of a pair of printing oscillators, which is bent at its end, and high frequency unit and low frequency unit are set on the same dielectric substrate. Through adding a parasitic unit on antenna, it can enhance frequency bandwidth without affecting the bandwidth. In the high frequency unit, it adopts gap-coupled microstrip line feeding method in order to get enough bandwidth. Through the test of dual broadband antenna, it can be found that, in the low frequency part, the antenna covers 20% bandwidth of the total bandwidth, and it covers the frequency from 800 MHz to 980 MHz. In the high frequency, the antenna covers 60% of total bandwidth and its frequency is from 1540 MHz to 2860 MHz, so the designed antenna can satisfy the frequency requirements of 2G/3G/LTE (4G communication system.
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility

Energy Technology Data Exchange (ETDEWEB)

Gallarno, George [Christian Brothers University; Rogers, James H [ORNL; Maxwell, Don E [ORNL

2015-01-01

The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world s second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simu- lations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercom- puter as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
[Architecture, budget and dignity].

Science.gov (United States)

Morel, Etienne

2012-01-01

Drawing on its dynamic strengths, a psychiatric unit develops various projects and care techniques. In this framework, the institute director must make a number of choices with regard to architecture. Why renovate the psychiatry building? What financial investments are required? What criteria should be followed? What if the major argument was based on the respect of the patient's dignity?
Dual-view inverted selective plane illumination microscopy (diSPIM) with improved background rejection for accurate 3D digital pathology

Science.gov (United States)

Hu, Bihe; Bolus, Daniel; Brown, J. Quincy

2018-02-01

Current gold-standard histopathology for cancerous biopsies is destructive, time consuming, and limited to 2D slices, which do not faithfully represent true 3D tumor micro-morphology. Light sheet microscopy has emerged as a powerful tool for 3D imaging of cancer biospecimens. Here, we utilize the versatile dual-view inverted selective plane illumination microscopy (diSPIM) to render digital histological images of cancer biopsies. Dual-view architecture enabled more isotropic resolution in X, Y, and Z; and different imaging modes, such as adding electronic confocal slit detection (eCSD) or structured illumination (SI), can be used to improve degraded image quality caused by background signal of large, scattering samples. To obtain traditional H&E-like images, we used DRAQ5 and eosin (D&E) staining, with 488nm and 647nm laser illumination, and multi-band filter sets. Here, phantom beads and a D&E stained buccal cell sample have been used to verify our dual-view method. We also show that via dual view imaging and deconvolution, more isotropic resolution has been achieved for optical cleared human prostate sample, providing more accurate quantitation of 3D tumor architecture than was possible with single-view SPIM methods. We demonstrate that the optimized diSPIM delivers more precise analysis of 3D cancer microarchitecture in human prostate biopsy than simpler light sheet microscopy arrangements.

Scalable fast multipole methods for vortex element methods

KAUST Repository

Hu, Qi; Gumerov, Nail A.; Yokota, Rio; Barba, Lorena A.; Duraiswami, Ramani

2012-01-01

work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures, which distributes the work between multi-core CPUs and GPUs to best utilize
Numerical simulation of air hypersonic flows with equilibrium chemical reactions

Science.gov (United States)

Emelyanov, Vladislav; Karpenko, Anton; Volkov, Konstantin

2018-05-01

The finite volume method is applied to solve unsteady three-dimensional compressible Navier-Stokes equations on unstructured meshes. High-temperature gas effects altering the aerodynamics of vehicles are taken into account. Possibilities of the use of graphics processor units (GPUs) for the simulation of hypersonic flows are demonstrated. Solutions of some test cases on GPUs are reported, and a comparison between computational results of equilibrium chemically reacting and perfect air flowfields is performed. Speedup of solution on GPUs with respect to the solution on central processor units (CPUs) is compared. The results obtained provide promising perspective for designing a GPU-based software framework for practical applications.
Simulating spin models on GPU

Science.gov (United States)

Weigel, Martin

2011-09-01

Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin models, such as the Ising model, on GPU as compared to conventional simulations on CPU.
R-GPU : A reconfigurable GPU architecture

NARCIS (Netherlands)

van den Braak, G.J.; Corporaal, H.

2016-01-01

Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU's Single Instruction/Multiple Threads
Dual Entwining Structures and Dual Entwined Modules

OpenAIRE

Abuhlail, Jawad Y.

2003-01-01

In this note we introduce and investigate the concepts of dual entwining structures and dual entwined modules. This generalizes the concepts of dual Doi-Koppinen structures and dual Doi-Koppinen modules introduced (in the infinite case over rings) by the author is his dissertation.
Reconfigurable dual-band metamaterial antenna based on liquid crystals

Science.gov (United States)

Che, Bang-Jun; Meng, Fan-Yi; Lyu, Yue-Long; Wu, Qun

2018-05-01

In this paper, a novel reconfigurable dual-band metamaterial antenna with a continuous beam that is electrically steered in backward to forward directions is first proposed by employing a liquid crystal (LC)-loaded tunable extended composite right-/left-handed (E-CRLH) transmission line (TL). The frequency-dependent property of the E-CRLH TL is analyzed and a compact unit cell based on the nematic LC is proposed to realize the tunable dual band characteristics. The phase constant of the proposed unit cell can be dynamically continuously tuned from negative to positive values in two operating bands by changing the bias voltage of the loaded LC material. A resulting dual band fixed-frequency beam steering property has been predicted by numerical simulations and experimentally verified. The measured results show that the fabricated reconfigurable antenna features an electrically controlled continuous beam steering from backward ‑16° to forward +13° at 7.2 GHz and backward ‑9° to forward +17° at 9.4 GHz, respectively. This electrically controlled beam steering range turns out to be competitive with the previously reported single band reconfigurable antennas. Besides, the measured and simulated results of the proposed reconfigurable dual-band metamaterial antenna are in good agreement.
Transportable GPU (General Processor Units) chip set technology for standard computer architectures

Science.gov (United States)

Fosdick, R. E.; Denison, H. C.

1982-11-01

The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.
Dual Kidney Allocation Score: A Novel Algorithm Utilizing Expanded Donor Criteria for the Allocation of Dual Kidneys in Adults.

Science.gov (United States)

Johnson, Adam P; Price, Thea P; Lieby, Benjamin; Doria, Cataldo

2016-09-08

BACKGROUND Dual kidney transplantation (DKT) of expanded-criteria donors is a cost-intensive procedure that aims to increase the pool of available deceased organ donors and has demonstrated equivalent outcomes to expanded-criteria single kidney transplantation (eSKT). The objective of this study was to develop an allocation score based on predicted graft survival from historical dual and single kidney donors. MATERIAL AND METHODS We analyzed United Network for Organ Sharing (UNOS) data for 1547 DKT and 26 381 eSKT performed between January 1994 and September 2013. We utilized multivariable Cox regression to identify variables independently associated with graft survival in dual and single kidney transplantations. We then derived a weighted multivariable product score from calculated hazard ratios to model the benefit of transplantation as dual kidneys. RESULTS Of 36 donor variables known at the time of listing, 13 were significantly associated with graft survival. The derived dual allocation score demonstrated good internal validity with strong correlation to improved survival in dual kidney transplants. Donors with scores less than 2.1 transplanted as dual kidneys had a worsened median survival of 594 days (24%, p-value 0.031) and donors with scores greater than 3.9 had improved median survival of 1107 days (71%, p-value 0.002). There were 17 733 eSKT (67%) and 1051 DKT (67%) with scores in between these values and no differences in survival (p-values 0.676 and 0.185). CONCLUSIONS We have derived a dual kidney allocation score (DKAS) with good internal validity. Future prospective studies will be required to demonstrate external validity, but this score may help to standardize organ allocation for dual kidney transplantation.
United abominations: Density functional studies of heavy metal chemistry

Energy Technology Data Exchange (ETDEWEB)

Schoendorff, George [Iowa State Univ., Ames, IA (United States)

2012-01-01

Carbonyl and nitrile addition to uranyl (UO₂²⁺) are studied. The competition between nitrile and water ligands in the formation of uranyl complexes is investigated. The possibility of hypercoordinated uranyl with acetone ligands is examined. Uranyl is studied with diactone alcohol ligands as a means to explain the apparent hypercoordinated uranyl. A discussion of the formation of mesityl oxide ligands is also included. A joint theory/experimental study of reactions of zwitterionic boratoiridium(I) complexes with oxazoline-based scorpionate ligands is reported. A computational study was done of the catalytic hydroamination/cyclization of aminoalkenes with zirconium-based catalysts. Techniques are surveyed for programming for graphical processing units (GPUs) using Fortran.
Calcul de Flux de Puissance amélioré grâce aux Processeurs Graphiques

OpenAIRE

Marin , Manuel

2015-01-01

This thesis addresses the utilization of Graphics Processing Units (GPUs) to improve the Power Flow (PF) analysis of modern power systems. GPUs are powerful vector co-processors that have been very useful in the acceleration of several computational intensive applications. PF analysis is the steady-state analysis of AC power networks and is widely used for several tasks involved in system operation and planning. Currently, GPUs are challenged by applications exhibiting an irregular computatio...
Floating point only SIMD instruction set architecture including compare, select, Boolean, and alignment operations

Science.gov (United States)

Gschwind, Michael K [Chappaqua, NY

2011-03-01

Mechanisms for implementing a floating point only single instruction multiple data instruction set architecture are provided. A processor is provided that comprises an issue unit, an execution unit coupled to the issue unit, and a vector register file coupled to the execution unit. The execution unit has logic that implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA). The floating point vector registers of the vector register file store both scalar and floating point values as vectors having a plurality of vector elements. The processor may be part of a data processing system.
Architectural Synthesis of Flow-Based Microfluidic Large-Scale Integration Biochips

DEFF Research Database (Denmark)

Minhass, Wajid Hassan; Pop, Paul; Madsen, Jan

2012-01-01

,we propose a top-down architectural synthesis methodology for the flow-based biochips. Starting from a given biochemical application and a microfluidic component library, we are interested in synthesizing a biochip architecture, i.e., performing component allocation from the library based on the biochemical....... By combining several microvalves, more complex units, such as micropumps, switches, mixers, and multiplexers, can be built. The manufacturing technology, soft lithography, used for the flow-based biochips is advancing faster than Moore's law, resulting in increased architectural complexity. However...... by synthesizing architectures for real-life applications as well as synthetic benchmarks....
Enhanced Flexibility and Reusability through State Machine-Based Architectures for Multisensor Intelligent Robotics

Directory of Open Access Journals (Sweden)

Héctor Herrero

2017-05-01

Full Text Available This paper presents a state machine-based architecture, which enhances the flexibility and reusability of industrial robots, more concretely dual-arm multisensor robots. The proposed architecture, in addition to allowing absolute control of the execution, eases the programming of new applications by increasing the reusability of the developed modules. Through an easy-to-use graphical user interface, operators are able to create, modify, reuse and maintain industrial processes, increasing the flexibility of the cell. Moreover, the proposed approach is applied in a real use case in order to demonstrate its capabilities and feasibility in industrial environments. A comparative analysis is presented for evaluating the presented approach versus traditional robot programming techniques.
A dual-route approach to orthographic processing.

Science.gov (United States)

Grainger, Jonathan; Ziegler, Johannes C

2011-01-01

In the present theoretical note we examine how different learning constraints, thought to be involved in optimizing the mapping of print to meaning during reading acquisition, might shape the nature of the orthographic code involved in skilled reading. On the one hand, optimization is hypothesized to involve selecting combinations of letters that are the most informative with respect to word identity (diagnosticity constraint), and on the other hand to involve the detection of letter combinations that correspond to pre-existing sublexical phonological and morphological representations (chunking constraint). These two constraints give rise to two different kinds of prelexical orthographic code, a coarse-grained and a fine-grained code, associated with the two routes of a dual-route architecture. Processing along the coarse-grained route optimizes fast access to semantics by using minimal subsets of letters that maximize information with respect to word identity, while coding for approximate within-word letter position independently of letter contiguity. Processing along the fined-grained route, on the other hand, is sensitive to the precise ordering of letters, as well as to position with respect to word beginnings and endings. This enables the chunking of frequently co-occurring contiguous letter combinations that form relevant units for morpho-orthographic processing (prefixes and suffixes) and for the sublexical translation of print to sound (multi-letter graphemes).
A dual-route approach to orthographic processing

Directory of Open Access Journals (Sweden)

Jonathan eGrainger

2011-04-01

Full Text Available In the present theoretical note we examine how different learning constraints, thought to be involved in optimizing the mapping of print to meaning during reading acquisition, might shape the nature of the orthographic code involved in skilled reading. On the one hand, optimization is hypothesized to involve selecting combinations of letters that are the most informative with respect to word identity (diagnosticity constraint, and on the other hand to involve the detection of letter combinations that correspond to pre-existing sublexical phonological and morphological representations (chunking constraint. These two constraints give rise to two different kinds of prelexical orthographic code, a coarse-grained and a fine-grained code, associated with the two routes of a dual-route architecture. Processing along the coarse-grained route optimizes fast access to semantics by using minimal subsets of letters that maximize information with respect to word identity, while coding for approximate within-word letter position independently of letter contiguity. Processing along the fined-grained route, on the other hand, is sensitive to the precise ordering of letters, as well as to position with respect to word beginnings and endings. This enables the chunking of frequently co-occurring contiguous letter combinations that form relevant units for morpho-orthographic processing (prefixes and suffixes and for the sublexical translation of print to sound (multi-letter graphemes.
Optimization Techniques for Dimensionally Truncated Sparse Grids on Heterogeneous Systems

KAUST Repository

Deftu, A.

2013-02-01

Given the existing heterogeneous processor landscape dominated by CPUs and GPUs, topics such as programming productivity and performance portability have become increasingly important. In this context, an important question refers to how can we develop optimization strategies that cover both CPUs and GPUs. We answer this for fastsg, a library that provides functionality for handling efficiently high-dimensional functions. As it can be employed for compressing and decompressing large-scale simulation data, it finds itself at the core of a computational steering application which serves us as test case. We describe our experience with implementing fastsg\\'s time critical routines for Intel CPUs and Nvidia Fermi GPUs. We show the differences and especially the similarities between our optimization strategies for the two architectures. With regard to our test case for which achieving high speedups is a "must" for real-time visualization, we report a speedup of up to 6.2x times compared to the state-of-the-art implementation of the sparse grid technique for GPUs. © 2013 IEEE.
Spatial Distribution of Iron Within the Normal Human Liver Using Dual-Source Dual-Energy CT Imaging.

Science.gov (United States)

Abadia, Andres F; Grant, Katharine L; Carey, Kathleen E; Bolch, Wesley E; Morin, Richard L

2017-11-01

Explore the potential of dual-source dual-energy (DSDE) computed tomography (CT) to retrospectively analyze the uniformity of iron distribution and establish iron concentration ranges and distribution patterns found in healthy livers. Ten mixtures consisting of an iron nitrate solution and deionized water were prepared in test tubes and scanned using a DSDE 128-slice CT system. Iron images were derived from a 3-material decomposition algorithm (optimized for the quantification of iron). A conversion factor (mg Fe/mL per Hounsfield unit) was calculated from this phantom study as the quotient of known tube concentrations and their corresponding CT values. Retrospective analysis was performed of patients who had undergone DSDE imaging for renal stones. Thirty-seven patients with normal liver function were randomly selected (mean age, 52.5 years). The examinations were processed for iron concentration. Multiple regions of interest were analyzed, and iron concentration (mg Fe/mL) and distribution was reported. The mean conversion factor obtained from the phantom study was 0.15 mg Fe/mL per Hounsfield unit. Whole-liver mean iron concentrations yielded a range of 0.0 to 2.91 mg Fe/mL, with 94.6% (35/37) of the patients exhibiting mean concentrations below 1.0 mg Fe/mL. The most important finding was that iron concentration was not uniform and patients exhibited regionally high concentrations (36/37). These regions of higher concentration were observed to be dominant in the middle-to-upper part of the liver (75%), medially (72.2%), and anteriorly (83.3%). Dual-source dual-energy CT can be used to assess the uniformity of iron distribution in healthy subjects. Applying similar techniques to unhealthy livers, future research may focus on the impact of hepatic iron content and distribution for noninvasive assessment in diseased subjects.
Dual pressurized light water reactor producing 2000 M We

Energy Technology Data Exchange (ETDEWEB)

NONE

2010-10-15

The dual unit optimizer 2000 M We (Duo2000) is proposed as a new design concept for large nuclear power plant. Duo is being designed to meet economic and safety challenges facing the 21 century green and sustainable energy industry. Duo2000 has two nuclear steam supply systems (NSSS) of the unit nuclear optimizer (Uno) pressurized water reactor (PWR) in a single containment so as to double the capacity of the plant. Uno is anchored to the optimized power reactor 1000 M We (OPR1000) of the Korea Hydro and Nuclear Power Co., Ltd. The concept of Duo can be extended to any number of PWRs or pressurized heavy water reactors (PHWR s), or even boiling water reactor (BWRs). Once proven in water reactors, the technology may even be expanded to gas cooled, liquid metal cooled, and molten salt cooled reactors. In particular, since it is required that the small and medium sized reactors (SMRs) be built as units, the concept of Duo2000 will apply to SMRs as well. With its in-vessel retention as severe accident management strategy, Duo can not only put the single most querulous PWR safety issue to end, but also pave ways to most promising large power capacity dispensing with huge redesigning cost for generation III + nuclear systems. The strengths of Duo2000 include reducing the cost of construction by decreasing the number of containment buildings from two to one, minimizing the cost of NSSS and control systems by sharing between the dual units, and lessening the maintenance cost by uniting NSSS. The technology can further be extended to coupling modular reactors as dual, triple, or quadruple units to increase their economics, thus accelerating the commercialization as well as the customization of SMRs. (Author)
Dual PECCS: a cognitive system for conceptual representation and categorization

Science.gov (United States)

Lieto, Antonio; Radicioni, Daniele P.; Rho, Valentina

2017-03-01

In this article we present an advanced version of Dual-PECCS, a cognitively-inspired knowledge representation and reasoning system aimed at extending the capabilities of artificial systems in conceptual categorization tasks. It combines different sorts of common-sense categorization (prototypical and exemplars-based categorization) with standard monotonic categorization procedures. These different types of inferential procedures are reconciled according to the tenets coming from the dual process theory of reasoning. On the other hand, from a representational perspective, the system relies on the hypothesis of conceptual structures represented as heterogeneous proxytypes. Dual-PECCS has been experimentally assessed in a task of conceptual categorization where a target concept illustrated by a simple common-sense linguistic description had to be identified by resorting to a mix of categorization strategies, and its output has been compared to human responses. The obtained results suggest that our approach can be beneficial to improve the representational and reasoning conceptual capabilities of standard cognitive artificial systems, and - in addition - that it may be plausibly applied to different general computational models of cognition. The current version of the system, in fact, extends our previous work, in that Dual- PECCS is now integrated and tested into two cognitive architectures, ACT-R and CLARION, implementing different assumptions on the underlying invariant structures governing human cognition. Such integration allowed us to extend our previous evaluation.
Architecture Level Safety Analyses for Safety-Critical Systems

Directory of Open Access Journals (Sweden)

K. S. Kushal

2017-01-01

Full Text Available The dependency of complex embedded Safety-Critical Systems across Avionics and Aerospace domains on their underlying software and hardware components has gradually increased with progression in time. Such application domain systems are developed based on a complex integrated architecture, which is modular in nature. Engineering practices assured with system safety standards to manage the failure, faulty, and unsafe operational conditions are very much necessary. System safety analyses involve the analysis of complex software architecture of the system, a major aspect in leading to fatal consequences in the behaviour of Safety-Critical Systems, and provide high reliability and dependability factors during their development. In this paper, we propose an architecture fault modeling and the safety analyses approach that will aid in identifying and eliminating the design flaws. The formal foundations of SAE Architecture Analysis & Design Language (AADL augmented with the Error Model Annex (EMV are discussed. The fault propagation, failure behaviour, and the composite behaviour of the design flaws/failures are considered for architecture safety analysis. The illustration of the proposed approach is validated by implementing the Speed Control Unit of Power-Boat Autopilot (PBA system. The Error Model Annex (EMV is guided with the pattern of consideration and inclusion of probable failure scenarios and propagation of fault conditions in the Speed Control Unit of Power-Boat Autopilot (PBA. This helps in validating the system architecture with the detection of the error event in the model and its impact in the operational environment. This also provides an insight of the certification impact that these exceptional conditions pose at various criticality levels and design assurance levels and its implications in verifying and validating the designs.

Multi-GPU accelerated three-dimensional FDTD method for electromagnetic simulation.

Science.gov (United States)

Nagaoka, Tomoaki; Watanabe, Soichi

2011-01-01

Numerical simulation with a numerical human model using the finite-difference time domain (FDTD) method has recently been performed in a number of fields in biomedical engineering. To improve the method's calculation speed and realize large-scale computing with the numerical human model, we adapt three-dimensional FDTD code to a multi-GPU environment using Compute Unified Device Architecture (CUDA). In this study, we used NVIDIA Tesla C2070 as GPGPU boards. The performance of multi-GPU is evaluated in comparison with that of a single GPU and vector supercomputer. The calculation speed with four GPUs was approximately 3.5 times faster than with a single GPU, and was slightly (approx. 1.3 times) slower than with the supercomputer. Calculation speed of the three-dimensional FDTD method using GPUs can significantly improve with an expanding number of GPUs.
Evaluation of an IP Fabric network architecture for CERN's data center

CERN Document Server

AUTHOR|(CDS)2156318; Barceló Ordinas, José M.

CERN has a large-scale data center with over 11500 servers used to analyze massive amounts of data acquired from the physics experiments and to provide IT services to workers. Its current network architecture is based on the classic three-tier design and it uses both IPv4 and IPv6. Between the access and aggregation layers the traffic is switched in Layer 2, while between aggregation and core it is routed using dual-stack OSPF. A new architecture is needed to increase redundancy and to provide virtual machine mobility and traffic isolation. The state-of-the-art architecture IP Fabric with EVPN is evaluated as a possible solution. The evaluation comprises a study of different features and options, including BGP table scalability and autonomous system number distributions. The proposed solution contains eBGP as the routing protocol, a route control policy, fast convergence mechanisms and an EVPN overlay with iBGP routing and VXLAN encapsulation. The solution is tested in the lab with the network equipment curre...
Development and Evaluation of Sterographic Display for Lung Cancer Screening

Science.gov (United States)

2008-12-01

burden. Application of GPUs – With the evolution of commodity graphics processing units (GPUs) for accelerating games on personal computers, over the...units, which are designed for rendering computer games , are readily available and can be programmed to perform the kinds of real-time calculations...575-581, 1994. 12. Anderson CM, Saloner D, Tsuruda JS, Shapeero LG, Lee RE. "Artifacts in maximun-intensity-projection display of MR angiograms
Performance of Point and Range Queries for In-memory Databases using Radix Trees on GPUs

Energy Technology Data Exchange (ETDEWEB)

Alam, Maksudul [ORNL; Yoginath, Srikanth B [ORNL; Perumalla, Kalyan S [ORNL

2016-01-01

In in-memory database systems augmented by hardware accelerators, accelerating the index searching operations can greatly increase the runtime performance of database queries. Recently, adaptive radix trees (ART) have been shown to provide very fast index search implementation on the CPU. Here, we focus on an accelerator-based implementation of ART. We present a detailed performance study of our GPU-based adaptive radix tree (GRT) implementation over a variety of key distributions, synthetic benchmarks, and actual keys from music and book data sets. The performance is also compared with other index-searching schemes on the GPU. GRT on modern GPUs achieves some of the highest rates of index searches reported in the literature. For point queries, a throughput of up to 106 million and 130 million lookups per second is achieved for sparse and dense keys, respectively. For range queries, GRT yields 600 million and 1000 million lookups per second for sparse and dense keys, respectively, on a large dataset of 64 million 32-bit keys.
A dual pressurized water reactor producing 2000 MWe

International Nuclear Information System (INIS)

Kang, K. M.; Suh, K. Y.

2010-01-01

The Dual Unit Optimizer 2000 MWe (DUO2000) is proposed as a new design concept for large nuclear power plant. DUO is being designed to meet economic and safety challenges facing the 21. century green and sustainable energy industry. DUO2000 has two nuclear steam supply systems (NSSSs) of the Unit Nuclear Optimizer (UNO) pressurized water reactor (PWR) in a single containment so as to double the capacity of the plant. UNO is anchored to the Optimized Power Reactor 1000 MWe (OPR1000). The concept of DUO can be extended to any number of PWRs or pressurized heavy water reactors (PHWRs), or even boiling water reactor (BWRs). Once proven in water reactors, the technology may even be expanded to gas cooled, liquid metal cooled, and molten salt cooled reactors. In particular, since it is required that the Small and Medium sized Reactors (SMRs) be built as units, the concept of DUO2000 will apply to SMRs as well. With its in-vessel retention external reactor vessel cooling (IVR-ERVC) as severe accident management strategy, DUO can not only put the single most querulous PWR safety issue to end, but also pave ways to most promising large power capacity dispensing with huge redesigning cost for Generation III+ nuclear systems. Also, the strengths of DUO2000 include reducing the cost of construction by decreasing the number of containment buildings from two to one, minimizing the cost of NSSS and control systems by sharing between the dual units, and lessening the maintenance cost by uniting the NSSS. Two prototypes are presented for the DUO2000, and their respective advantages and drawbacks are considered. The strengths include, but are not necessarily limited to, reducing the cost of construction by decreasing the number of containment buildings from two to one, minimizing the cost of NSSS and control systems by sharing between the dual units, and lessening the maintenance cost by uniting the NSSS, just to name the few. The Coolant Unit Branching Apparatus (CUBA) is proposed
Multibeam Gpu Transient Pipeline for the Medicina BEST-2 Array

Science.gov (United States)

Magro, A.; Hickish, J.; Adami, K. Z.

2013-09-01

Radio transient discovery using next generation radio telescopes will pose several digital signal processing and data transfer challenges, requiring specialized high-performance backends. Several accelerator technologies are being considered as prototyping platforms, including Graphics Processing Units (GPUs). In this paper we present a real-time pipeline prototype capable of processing multiple beams concurrently, performing Radio Frequency Interference (RFI) rejection through thresholding, correcting for the delay in signal arrival times across the frequency band using brute-force dedispersion, event detection and clustering, and finally candidate filtering, with the capability of persisting data buffers containing interesting signals to disk. This setup was deployed at the BEST-2 SKA pathfinder in Medicina, Italy, where several benchmarks and test observations of astrophysical transients were conducted. These tests show that on the deployed hardware eight 20 MHz beams can be processed simultaneously for 640 Dispersion Measure (DM) values. Furthermore, the clustering and candidate filtering algorithms employed prove to be good candidates for online event detection techniques. The number of beams which can be processed increases proportionally to the number of servers deployed and number of GPUs, making it a viable architecture for current and future radio telescopes.
A Performance/Cost Evaluation for a GPU-Based Drug Discovery Application on Volunteer Computing

Science.gov (United States)

Guerrero, Ginés D.; Imbernón, Baldomero; García, José M.

2014-01-01

Bioinformatics is an interdisciplinary research field that develops tools for the analysis of large biological databases, and, thus, the use of high performance computing (HPC) platforms is mandatory for the generation of useful biological knowledge. The latest generation of graphics processing units (GPUs) has democratized the use of HPC as they push desktop computers to cluster-level performance. Many applications within this field have been developed to leverage these powerful and low-cost architectures. However, these applications still need to scale to larger GPU-based systems to enable remarkable advances in the fields of healthcare, drug discovery, genome research, etc. The inclusion of GPUs in HPC systems exacerbates power and temperature issues, increasing the total cost of ownership (TCO). This paper explores the benefits of volunteer computing to scale bioinformatics applications as an alternative to own large GPU-based local infrastructures. We use as a benchmark a GPU-based drug discovery application called BINDSURF that their computational requirements go beyond a single desktop machine. Volunteer computing is presented as a cheap and valid HPC system for those bioinformatics applications that need to process huge amounts of data and where the response time is not a critical factor. PMID:25025055
Evolutionary dynamics of protein domain architecture in plants

Directory of Open Access Journals (Sweden)

Zhang Xue-Cheng

2012-01-01

Full Text Available Abstract Background Protein domains are the structural, functional and evolutionary units of the protein. Protein domain architectures are the linear arrangements of domain(s in individual proteins. Although the evolutionary history of protein domain architecture has been extensively studied in microorganisms, the evolutionary dynamics of domain architecture in the plant kingdom remains largely undefined. To address this question, we analyzed the lineage-based protein domain architecture content in 14 completed green plant genomes. Results Our analyses show that all 14 plant genomes maintain similar distributions of species-specific, single-domain, and multi-domain architectures. Approximately 65% of plant domain architectures are universally present in all plant lineages, while the remaining architectures are lineage-specific. Clear examples are seen of both the loss and gain of specific protein architectures in higher plants. There has been a dynamic, lineage-wise expansion of domain architectures during plant evolution. The data suggest that this expansion can be largely explained by changes in nuclear ploidy resulting from rounds of whole genome duplications. Indeed, there has been a decrease in the number of unique domain architectures when the genomes were normalized into a presumed ancestral genome that has not undergone whole genome duplications. Conclusions Our data show the conservation of universal domain architectures in all available plant genomes, indicating the presence of an evolutionarily conserved, core set of protein components. However, the occurrence of lineage-specific domain architectures indicates that domain architecture diversity has been maintained beyond these core components in plant genomes. Although several features of genome-wide domain architecture content are conserved in plants, the data clearly demonstrate lineage-wise, progressive changes and expansions of individual protein domain architectures, reinforcing
Progress in a novel architecture for high performance processing

Science.gov (United States)

Zhang, Zhiwei; Liu, Meng; Liu, Zijun; Du, Xueliang; Xie, Shaolin; Ma, Hong; Ding, Guangxin; Ren, Weili; Zhou, Fabiao; Sun, Wenqin; Wang, Huijuan; Wang, Donglin

2018-04-01

The high performance processing (HPP) is an innovative architecture which targets on high performance computing with excellent power efficiency and computing performance. It is suitable for data intensive applications like supercomputing, machine learning and wireless communication. An example chip with four application-specific integrated circuit (ASIC) cores which is the first generation of HPP cores has been taped out successfully under Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm low power process. The innovative architecture shows great energy efficiency over the traditional central processing unit (CPU) and general-purpose computing on graphics processing units (GPGPU). Compared with MaPU, HPP has made great improvement in architecture. The chip with 32 HPP cores is being developed under TSMC 16 nm field effect transistor (FFC) technology process and is planed to use commercially. The peak performance of this chip can reach 4.3 teraFLOPS (TFLOPS) and its power efficiency is up to 89.5 gigaFLOPS per watt (GFLOPS/W).
A novel control framework for nonlinear time-delayed dual-master/single-slave teleoperation.

Science.gov (United States)

Ghorbanian, A; Rezaei, S M; Khoogar, A R; Zareinejad, M; Baghestan, K

2013-03-01

A novel trilateral control architecture for the Dual-master/Single-slave teleoperation is proposed in this paper. This framework has been used in surgical training and rehabilitation applications. In this structure, the slave motion has been controlled by weighted summation of signals transmitted by the operator referring to task control authority through the dominance factors. The nonlinear dynamics for telemanipulators are considered which were considered as disregarded issues in previous studies of this field. Bounded variable time-delay has been considered which affects the transmitted signals in the communication channels. Two types of controllers have been offered and an appropriate stability analysis for each controller has been demonstrated. The first controller includes Proportional with dissipative gains (P+d). The second one contains Proportional and Derivative with dissipative gains (PD+d). In both cases, the stability of the trilateral control framework is preserved by choosing appropriate controller's gains. It is shown that these controllers attempt to coordinate the positions of telemanipulators in the free motion condition. The stability of the Dual-master/Single-slave teleoperation has been proved by an appropriate Lyapunov like function and the stability conditions have been studied. In addition the proposed PD+d control architecture is modified for trilateral teleoperation with internet communication between telemanipulators that caused such communication complications as packet loss, data duplication and swapping. A number of experiments have been conducted with various levels of dominance factor to validate the effectiveness of the new control architecture. Copyright © 2012 ISA. Published by Elsevier Ltd. All rights reserved.
Applying graphics processor units to Monte Carlo dose calculation in radiation therapy

Directory of Open Access Journals (Sweden)

Bakhtiari M

2010-01-01

Full Text Available We investigate the potential in using of using a graphics processor unit (GPU for Monte-Carlo (MC-based radiation dose calculations. The percent depth dose (PDD of photons in a medium with known absorption and scattering coefficients is computed using a MC simulation running on both a standard CPU and a GPU. We demonstrate that the GPU′s capability for massive parallel processing provides a significant acceleration in the MC calculation, and offers a significant advantage for distributed stochastic simulations on a single computer. Harnessing this potential of GPUs will help in the early adoption of MC for routine planning in a clinical environment.
The UK approach to desalination and nuclear power dual purpose operation

International Nuclear Information System (INIS)

Pugh, O.

1974-01-01

Nuclear desalination is a particular example of dual purpose operation and the majority of desalting units installed around the world are operated in this way. A nuclear dual purpose concept has to be very large if present economic reactor designs are utilised. It is the size which has defeated the concept to date. Present fossil fired dual purpose installations are either in an economic situation (generally low fuel cost) where the inefficiencies introduced by operating away from the optimum water/power ratio are acceptable or, if optimised, the water and power blocks are small enough to allow introduction into the existing utility networks. As part of the United Kingdom, Water Resources Board (WRB) report 'Desalination 1972' the Central Electricity Generating Board (CEGB) and WRB identified nine coastal sites in the United Kingdom where nuclear power stations might be built during the next 15 years. The difficulties of dual purpose operation were recognised in the report, including additional water storage to cover the summer shutdown (turbine overhaul) period, modification of station design to facilitate the extraction of steam, etc. More seriously, as a given power station had higher fuelling costs relative to the newer station, the electrical utility might require compensation for continuing to operate it because of the associated desalting plant. Taking account of these factors and the replacement of the lost electricity production from other, maybe less efficient stations on the system
Planning intensive care unit design using computer simulation modeling: optimizing integration of clinical, operational, and architectural requirements.

Science.gov (United States)

OʼHara, Susan

2014-01-01

Nurses have increasingly been regarded as critical members of the planning team as architects recognize their knowledge and value. But the nurses' role as knowledge experts can be expanded to leading efforts to integrate the clinical, operational, and architectural expertise through simulation modeling. Simulation modeling allows for the optimal merge of multifactorial data to understand the current state of the intensive care unit and predict future states. Nurses can champion the simulation modeling process and reap the benefits of a cost-effective way to test new designs, processes, staffing models, and future programming trends prior to implementation. Simulation modeling is an evidence-based planning approach, a standard, for integrating the sciences with real client data, to offer solutions for improving patient care.
GPU's for event reconstruction in the FairRoot framework

International Nuclear Information System (INIS)

Al-Turany, M; Uhlig, F; Karabowicz, R

2010-01-01

FairRoot is the simulation and analysis framework used by CBM and PANDA experiments at FAIR/GSI. The use of graphics processor units (GPUs) for event reconstruction in FairRoot will be presented. The fact that CUDA (Nvidia's Compute Unified Device Architecture) development tools work alongside the conventional C/C++ compiler, makes it possible to mix GPU code with general-purpose code for the host CPU, based on this some of the reconstruction tasks can be send to the graphic cards. Moreover, tasks that run on the GPU's can also run in emulation mode on the host CPU, which has the advantage that the same code is used on both CPU and GPU.
Evaluation of digital fault-tolerant architectures for nuclear power plant control systems

International Nuclear Information System (INIS)

Battle, R.E.

1990-01-01

This paper reports on four fault-tolerant architectures that were evaluated for their potential reliability in service as control systems of nuclear power plants. The reliability analyses showed that human- and software-related common cause failures and single points of failure in the output modules are dominant contributors to system unreliability. The four architectures are triple-modular-redundant, both synchronous and asynchronous, and also dual synchronous and asynchronous. The evaluation includes a review of design features, an analysis of the importance of coverage, and reliability analyses of fault-tolerant systems. Reliability analyses based on data from several industries that have fault-tolerant controllers were used to estimate the mean-time-between-failures of fault-tolerant controllers and to predict those failure modes that may be important in nuclear power plants
A distributed fault tolerant architecture for nuclear reactor control and safety functions

International Nuclear Information System (INIS)

Hecht, M.; Agron, J.; Hochhauser, S.

1989-01-01

This paper reports on a fault tolerance architecture that provides tolerance to a broad scope of hardware, software, and communications faults which is being developed. This architecture relies on widely commercially available operating systems, local area networks, and software standards. Thus, development time is significantly shortened, and modularity allows for continuous and inexpensive system enhancement throughout the expected 20- year life. The fault containment and parallel processing capabilites of computers network are being exploited to provide a high performance, high availability network capable of tolerating a broad scope of hardware software, and operating system faults. The system can tolerate all but one known (and avoidable) single fault, two known and avoidable dual faults, and will detect all higher order fault sequences and provide diagnostics to allow for rapid manual recovery
Nanosatellite and Plug-and-Play Architecture 2 (NAPA 2)

Science.gov (United States)

2017-02-28

development of a 6U- format Space Plug-and-play Architecture (SPA) Research Cubesat (SPARC). SPARC-1 (first and only pursued under this PA) demonstrates...development of a six unit (6U)- format Space Plug-and-play Architecture (SPA) Research Cubesat (SPARC). SPARC-1 (first and only pursued under this PA...computers – More capable, more centralized, bigger wiring bundle Elimination of central computers, distribution of intelligence in systems Rad- hard
60-GHz Millimeter-wave Over Fiber with Directly Modulated Dual-mode Laser Diode

Science.gov (United States)

Tsai, Cheng-Ting; Lin, Chi-Hsiang; Lin, Chun-Ting; Chi, Yu-Chieh; Lin, Gong-Ru

2016-01-01

A directly modulated dual-mode laser diode (DMLD) with third-order intermodulation distortion (IMD3) suppression is proposed for a 60-GHz millimeter-wave over fiber (MMWoF) architecture, enabling new fiber-wireless communication access to cover 4-km single-mode-fiber (SMF) and 3-m wireless 16-QAM OFDM transmissions. By dual-mode injection-locking, the throughput degradation of the DMLD is mitigated with saturation effect to reduce its threshold, IMD3 power and relative intensity noise to 7.7 mA, −85 dBm and −110.4 dBc/Hz, respectively, providing huge spurious-free dynamic range of 85.8 dB/Hz2/3. This operation suppresses the noise floor of the DMLD carried QPSK-OFDM spectrum by 5 dB. The optical receiving power is optimized to restrict the power fading effect for improving the bit error rate to 1.9 × 10−3 and the receiving power penalty to 1.1 dB. Such DMLD based hybrid architecture for 60-GHz MMW fiber-wireless access can directly cover the current optical and wireless networks for next-generation indoor and short-reach mobile communications. PMID:27297267
Some properties of dual and approximate dual of fusion frames

OpenAIRE

Arefijamaal, Ali Akbar; Neyshaburi, Fahimeh Arabyani

2016-01-01

In this paper we extend the notion of approximate dual to fusion frames and present some approaches to obtain dual and approximate alternate dual fusion frames. Also, we study the stability of dual and approximate alternate dual fusion frames.
Theoretical Interpretation of Modular Artistic Forms Based on the Example of Contemporarylithuanian Architecture

Directory of Open Access Journals (Sweden)

Aušra Černauskienė

2015-05-01

Full Text Available The article analyses modular artistic forms that emerge in all scale structures of contemporary architecture. Module, as a standard unit of measure has been in use since antiquity. It gained even more significance amid innovative building and computing technologies of the 20th and 21st centuries. Static and fixed perceptions of a module were supplemented with concepts of dynamic and adaptable modular units, such as fractals, parameters and algorithms. Various expressions and trends of modular design appear in contemporary architecture of Lithuania, where modular forms consist of repetitive spatial and planar elements. Spatial modules as blocks or flats and planar modular wall elements are a characteristic expression of the contemporary architecture in Lithuania.

The continuous process – social production of architecture in Hestnes Ferreira

Directory of Open Access Journals (Sweden)

Alexandra Saraiva

2018-05-01

Full Text Available This article aims at describing the continuous process in the social production of the architecture of Raúl Hestnes Ferreira. The neorealist ideology defended by his father and followed by the family, as well as the values of freedom, democracy and respect for the others, built his personality and his humanistic character. His cross-cultural career in Portugal, Finland and the United States of America was instrumental in building his architectural lexicon. In order to illustrate these influences, four housing works with different conceptual dimensions are presented such as laboratorial experiments: the José Gomes Ferreira House in Albarraque (1960-1961, the Twin Housing in Queijas (1967-1973, finishing with the presentation of two social housing experiences, namely, the neighborhood Fonsecas and Calçada (1974-1986 under the SAAL project in Lisbon and the João Barbeiro Housing Unit (1978-1987 in Beja. In Hestnes Ferreira, the social production of architecture was not a consequence, nor an anticipation, but a fact that by the simultaneity, defined and characterized his architecture.
New paradigms in internal architecture design and freeform fabrication of tissue engineering porous scaffolds.

Science.gov (United States)

Yoo, Dongjin

2012-07-01

Advanced additive manufacture (AM) techniques are now being developed to fabricate scaffolds with controlled internal pore architectures in the field of tissue engineering. In general, these techniques use a hybrid method which combines computer-aided design (CAD) with computer-aided manufacturing (CAM) tools to design and fabricate complicated three-dimensional (3D) scaffold models. The mathematical descriptions of micro-architectures along with the macro-structures of the 3D scaffold models are limited by current CAD technologies as well as by the difficulty of transferring the designed digital models to standard formats for fabrication. To overcome these difficulties, we have developed an efficient internal pore architecture design system based on triply periodic minimal surface (TPMS) unit cell libraries and associated computational methods to assemble TPMS unit cells into an entire scaffold model. In addition, we have developed a process planning technique based on TPMS internal architecture pattern of unit cells to generate tool paths for freeform fabrication of tissue engineering porous scaffolds. Copyright © 2012 IPEM. Published by Elsevier Ltd. All rights reserved.
Architectural slicing

DEFF Research Database (Denmark)

Christensen, Henrik Bærbak; Hansen, Klaus Marius

2013-01-01

Architectural prototyping is a widely used practice, con- cerned with taking architectural decisions through experiments with light- weight implementations. However, many architectural decisions are only taken when systems are already (partially) implemented. This is prob- lematic in the context...... of architectural prototyping since experiments with full systems are complex and expensive and thus architectural learn- ing is hindered. In this paper, we propose a novel technique for harvest- ing architectural prototypes from existing systems, \\architectural slic- ing", based on dynamic program slicing. Given...... a system and a slicing criterion, architectural slicing produces an architectural prototype that contain the elements in the architecture that are dependent on the ele- ments in the slicing criterion. Furthermore, we present an initial design and implementation of an architectural slicer for Java....
Deciphering structural and temporal interplays during the architectural development of mango trees.

Science.gov (United States)

Dambreville, Anaëlle; Lauri, Pierre-Éric; Trottier, Catherine; Guédon, Yann; Normand, Frédéric

2013-05-01

Plant architecture is commonly defined by the adjacency of organs within the structure and their properties. Few studies consider the effect of endogenous temporal factors, namely phenological factors, on the establishment of plant architecture. This study hypothesized that, in addition to the effect of environmental factors, the observed plant architecture results from both endogenous structural and temporal components, and their interplays. Mango tree, which is characterized by strong phenological asynchronisms within and between trees and by repeated vegetative and reproductive flushes during a growing cycle, was chosen as a plant model. During two consecutive growing cycles, this study described vegetative and reproductive development of 20 trees submitted to the same environmental conditions. Four mango cultivars were considered to assess possible cultivar-specific patterns. Integrative vegetative and reproductive development models incorporating generalized linear models as components were built. These models described the occurrence, intensity, and timing of vegetative and reproductive development at the growth unit scale. This study showed significant interplays between structural and temporal components of plant architectural development at two temporal scales. Within a growing cycle, earliness of bud burst was highly and positively related to earliness of vegetative development and flowering. Between growing cycles, flowering growth units delayed vegetative development compared to growth units that did not flower. These interplays explained how vegetative and reproductive phenological asynchronisms within and between trees were generated and maintained. It is suggested that causation networks involving structural and temporal components may give rise to contrasted tree architectures.
A family of fluoride-specific ion channels with dual-topology architecture.

Science.gov (United States)

Stockbridge, Randy B; Robertson, Janice L; Kolmakova-Partensky, Ludmila; Miller, Christopher

2013-08-27

Fluoride ion, ubiquitous in soil, water, and marine environments, is a chronic threat to microorganisms. Many prokaryotes, archea, unicellular eukaryotes, and plants use a recently discovered family of F(-) exporter proteins to lower cytoplasmic F(-) levels to counteract the anion's toxicity. We show here that these 'Fluc' proteins, purified and reconstituted in liposomes and planar phospholipid bilayers, form constitutively open anion channels with extreme selectivity for F(-) over Cl(-). The active channel is a dimer of identical or homologous subunits arranged in antiparallel transmembrane orientation. This dual-topology assembly has not previously been seen in ion channels but is known in multidrug transporters of the SMR family, and is suggestive of an evolutionary antecedent of the inverted repeats found within the subunits of many membrane transport proteins. DOI:http://dx.doi.org/10.7554/eLife.01084.001.
MHD code using multi graphical processing units: SMAUG+

Science.gov (United States)

Gyenge, N.; Griffiths, M. K.; Erdélyi, R.

2018-01-01

This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths et al., 2015). Here, we demonstrate the validity of the SMAUG + code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: 1000 × 1000, 2044 × 2044, 4000 × 4000 and 8000 × 8000 . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems.
High performance architecture design for large scale fibre-optic sensor arrays using distributed EDFAs and hybrid TDM/DWDM

Science.gov (United States)

Liao, Yi; Austin, Ed; Nash, Philip J.; Kingsley, Stuart A.; Richardson, David J.

2013-09-01

A distributed amplified dense wavelength division multiplexing (DWDM) array architecture is presented for interferometric fibre-optic sensor array systems. This architecture employs a distributed erbium-doped fibre amplifier (EDFA) scheme to decrease the array insertion loss, and employs time division multiplexing (TDM) at each wavelength to increase the number of sensors that can be supported. The first experimental demonstration of this system is reported including results which show the potential for multiplexing and interrogating up to 4096 sensors using a single telemetry fibre pair with good system performance. The number can be increased to 8192 by using dual pump sources.
SUSTAINABLE ARCHITECTURE : WHAT ARCHITECTURE STUDENTS THINK

OpenAIRE

SATWIKO, PRASASTO

2013-01-01

Sustainable architecture has become a hot issue lately as the impacts of climate change become more intense. Architecture educations have responded by integrating knowledge of sustainable design in their curriculum. However, in the real life, new buildings keep coming with designs that completely ignore sustainable principles. This paper discusses the results of two national competitions on sustainable architecture targeted for architecture students (conducted in 2012 and 2013). The results a...
The ATLAS Trigger Algorithms for General Purpose Graphics Processor Units

CERN Document Server

Tavares Delgado, Ademar; The ATLAS collaboration

2016-01-01

The ATLAS Trigger Algorithms for General Purpose Graphics Processor Units Type: Talk Abstract: We present the ATLAS Trigger algorithms developed to exploit General Purpose Graphics Processor Units. ATLAS is a particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system has two levels, hardware-based Level 1 and the High Level Trigger implemented in software running on a farm of commodity CPU. Performing the trigger event selection within the available farm resources presents a significant challenge that will increase future LHC upgrades. are being evaluated as a potential solution for trigger algorithms acceleration. Key factors determining the potential benefit of this new technology are the relative execution speedup, the number of GPUs required and the relative financial cost of the selected GPU. We have developed a trigger demonstrator which includes algorithms for reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Cal...
A Low-Complexity Euclidean Orthogonal LDPC Architecture for Low Power Applications.

Science.gov (United States)

Revathy, M; Saravanan, R

2015-01-01

Low-density parity-check (LDPC) codes have been implemented in latest digital video broadcasting, broadband wireless access (WiMax), and fourth generation of wireless standards. In this paper, we have proposed a high efficient low-density parity-check code (LDPC) decoder architecture for low power applications. This study also considers the design and analysis of check node and variable node units and Euclidean orthogonal generator in LDPC decoder architecture. The Euclidean orthogonal generator is used to reduce the error rate of the proposed LDPC architecture, which can be incorporated between check and variable node architecture. This proposed decoder design is synthesized on Xilinx 9.2i platform and simulated using Modelsim, which is targeted to 45 nm devices. Synthesis report proves that the proposed architecture greatly reduces the power consumption and hardware utilizations on comparing with different conventional architectures.
A Low-Complexity Euclidean Orthogonal LDPC Architecture for Low Power Applications

Directory of Open Access Journals (Sweden)

M. Revathy

2015-01-01

Full Text Available Low-density parity-check (LDPC codes have been implemented in latest digital video broadcasting, broadband wireless access (WiMax, and fourth generation of wireless standards. In this paper, we have proposed a high efficient low-density parity-check code (LDPC decoder architecture for low power applications. This study also considers the design and analysis of check node and variable node units and Euclidean orthogonal generator in LDPC decoder architecture. The Euclidean orthogonal generator is used to reduce the error rate of the proposed LDPC architecture, which can be incorporated between check and variable node architecture. This proposed decoder design is synthesized on Xilinx 9.2i platform and simulated using Modelsim, which is targeted to 45 nm devices. Synthesis report proves that the proposed architecture greatly reduces the power consumption and hardware utilizations on comparing with different conventional architectures.
Accelerating MATLAB with GPU computing a primer with examples

CERN Document Server

Suh, Jung W

2013-01-01

Beyond simulation and algorithm development, many developers increasingly use MATLAB even for product deployment in computationally heavy fields. This often demands that MATLAB codes run faster by leveraging the distributed parallelism of Graphics Processing Units (GPUs). While MATLAB successfully provides high-level functions as a simulation tool for rapid prototyping, the underlying details and knowledge needed for utilizing GPUs make MATLAB users hesitate to step into it. Accelerating MATLAB with GPUs offers a primer on bridging this gap. Starting with the basics, setting up MATLAB for
Lattice QCD simulations using the OpenACC platform

International Nuclear Information System (INIS)

Majumdar, Pushan

2016-01-01

In this article we will explore the OpenACC platform for programming Graphics Processing Units (GPUs). The OpenACC platform offers a directive based programming model for GPUs which avoids the detailed data flow control and memory management necessary in a CUDA programming environment. In the OpenACC model, programs can be written in high level languages with OpenMP like directives. We present some examples of QCD simulation codes using OpenACC and discuss their performance on the Fermi and Kepler GPUs. (paper)
Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects

International Nuclear Information System (INIS)

Agullo, Emmanuel; Demmel, Jim; Dongarra, Jack; Hadri, Bilel; Kurzak, Jakub; Langou, Julien; Ltaief, Hatem; Luszczek, Piotr; Tomov, Stanimire

2009-01-01

The emergence and continuing use of multi-core architectures and graphics processing units require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) and Matrix Algebra on GPU and Multics Architectures (MAGMA) are two projects that aims to achieve high performance and portability across a wide range of multi-core architectures and hybrid systems respectively. We present in this document a comparative study of PLASMA's performance against established linear algebra packages and some preliminary results of MAGMA on hybrid multi-core and GPU systems.
SULTAN HASSAN MOSQUE: AN ISLAMIC ARCHITECTURAL WONDER ANALYTICAL STUDY OF DESIGN AND ITS EFFECT ON ISLAMIC CAIRO

Directory of Open Access Journals (Sweden)

Kareem Adel Mohamed Kamal Ismail

2012-03-01

Full Text Available Cities in 21st century are losing identity due to globalization and rapid urbanism. However, great architectural buildings like Sultan Hassan Mosque Complex show us that great architectural wonders can keep this identity and can affect positively in society’s life. The simple aim of this study is to investigate the relationship between architectural features and Islamic meanings in modern world through studying the past. The study is mainly based on two main sources of data, literature review regarding historical part and site visit dealing with discussion of architectural features, uses and effect on surroundings society. Based on these sources, analysis was made based on matrix relationship between two sets of criteria, architectural parts (design, usage, location, artistic features, and mosque significance in Islam (prayer house, community center, center of knowledge, meeting place for shoura. Findings proved the existence of consequent relationship between Islam and architecture, as Islamic principles affect the design of the mosque in religious, social, and service aspects. Alternatively, architectural building satisfies all Muslim needs. This dual effect situation shapes recommend-dations like enhancement of the multidimensional use of the mosque, strengthen the community service role of the mosque, and developing design of modern mosques to fulfill Muslim requirements with 21st century measures and also endorse Islamic values through architecture.
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

OpenAIRE

Wang, Linnan; Ye, Jinmian; Zhao, Yiyang; Wu, Wei; Li, Ang; Song, Shuaiwen Leon; Xu, Zenglin; Kraska, Tim

2018-01-01

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far be...
Architecture

OpenAIRE

Clear, Nic

2014-01-01

When discussing science fiction’s relationship with architecture, the usual practice is to look at the architecture “in” science fiction—in particular, the architecture in SF films (see Kuhn 75-143) since the spaces of literary SF present obvious difficulties as they have to be imagined. In this essay, that relationship will be reversed: I will instead discuss science fiction “in” architecture, mapping out a number of architectural movements and projects that can be viewed explicitly as scien...
Performance evaluation of H.264/AVC decoding and visualization using the GPU

OpenAIRE

Pieters, Bart; Van Rijsselbergen, Dieter; De Neve, Wesley; Van de Walle, Rik

2007-01-01

The coding efficiency of the H.264/AVC standard makes the decoding process computationally demanding. This has limited the availability of cost-effective, high-performance solutions. Modern computers are typically equipped with powerful yet cost-effective Graphics Processing Units (GPUs) to accelerate graphics operations. These GPUs can be addressed by means of a 3-D graphics API such as Microsoft Direct3D or OpenGL, using programmable shaders as generic processing units for vector data. The ...
QCD as a dual superconductor

International Nuclear Information System (INIS)

Zachariasen, F.

1986-01-01

The author describes the construction of an effective action describing long-range Yang-Mills theory. This action is motivated by a study of the system of Dyson equations and Ward identities, but cannot (yet) be derived from the underlying quantum theory. The effective action turns out to describe a medium very much like a dual relativistic superconductor; that is, with electric and magnetic fields interchanged. There is a dual Meissner effect, which serves to compress color electric fields into flux tubes, containing quantized units of color electric flux. This produces electric confinement. There is a magnetic condensate, resulting from a spontaneous symmetry breaking analogous to that in the relativistic superconductor, as in the Abelian Higgs model. He gives the motivation leading to the effective action, and describes the quantized electric flux tube solutions. Finally, he mentions briefly some other applications
Enron Flaws In Organizational Architecture And Its Failure

Directory of Open Access Journals (Sweden)

Nguyen

2015-08-01

Full Text Available A series of corporate scandals at the beginning of last decade has given rise to the doubt on the efficiency of corporate governance practice in the United States. Of these scandals the collapse of Enron has exceptionally captured the public concern. It was the once seventh-largest company in the United States 1. It was rated the most innovative large company in America in Fortunes Most Admired Companies survey 2. In August 2000 its stock reached a peak of nearly 70 billion 3. However within a year its stock had become almost useless papers 2. It just was unbelievable for many people. What went wrong Was it due to the failure of corporate governance in general Actually the central factor leading to the collapse of Enron was the failure in its organizational architecture. This paper starts by providing an overview of corporate governance system with an emphasis on the corporate organizational architecture as its important facet. Then it discusses flaws in the organizational architecture of Enron and argues that these eventually led to the breakdown of the whole corporate governance system at Enron. Finally some implications and lessons for the practice of corporate governance are presented.

Turbo decoder architecture for beyond-4G applications

CERN Document Server

Wong, Cheng-Chi

2013-01-01

This book describes the most recent techniques for turbo decoder implementation, especially for 4G and beyond 4G applications. The authors reveal techniques for the design of high-throughput decoders for future telecommunication systems, enabling designers to reduce hardware cost and shorten processing time. Coverage includes an explanation of VLSI implementation of the turbo decoder, from basic functional units to advanced parallel architecture. The authors discuss both hardware architecture techniques and experimental results, showing the variations in area/throughput/performance with respec
VLSI Architectures for Sliding-Window-Based Space-Time Turbo Trellis Code Decoders

Directory of Open Access Journals (Sweden)

Georgios Passas

2012-01-01

Full Text Available The VLSI implementation of SISO-MAP decoders used for traditional iterative turbo coding has been investigated in the literature. In this paper, a complete architectural model of a space-time turbo code receiver that includes elementary decoders is presented. These architectures are based on newly proposed building blocks such as a recursive add-compare-select-offset (ACSO unit, A-, B-, Γ-, and LLR output calculation modules. Measurements of complexity and decoding delay of several sliding-window-technique-based MAP decoder architectures and a proposed parameter set lead to defining equations and comparison between those architectures.
Modeling Architectural Patterns Using Architectural Primitives

NARCIS (Netherlands)

Zdun, Uwe; Avgeriou, Paris

2005-01-01

Architectural patterns are a key point in architectural documentation. Regrettably, there is poor support for modeling architectural patterns, because the pattern elements are not directly matched by elements in modeling languages, and, at the same time, patterns support an inherent variability that
Dead-time free pixel readout architecture for ATLAS front-end IC

CERN Document Server

Einsweiler, Kevin F; Kleinfelder, S A; Luo, L; Marchesini, R; Milgrome, O; Pengg, F X

1999-01-01

A low power sparse scan readout architecture has been developed for the ATLAS pixel front-end IC. The architecture supports a dual discriminator and extracts the time over threshold (TOT) information along with a 2-D spatial address $9 of the hits associating them with a unique 7-bit beam crossing number. The IC implements level-1 trigger filtering along with event building (grouping together all hits in a beam crossing) in the end of column (EOC) buffer. The $9 events are transmitted over a 40 MHz serial data link with the protocol supporting buffer overflow handling by appending error flags to events. This mixed-mode full custom IC is implemented in 0.8 mu HP process to meet the $9 requirements for the pixel readout in the ATLAS inner detector. The circuits have been tested and the IC provides dead-time-less ambiguity free readout at 40 MHz data rate.
Dual Electron Spectrometer for Magnetospheric Multiscale Mission: Results of the Comprehensive Tests of the Engineering Test Unit

Science.gov (United States)

Avanov, Levon A.; Gliese, Ulrik; Mariano, Albert; Tucker, Corey; Barrie, Alexander; Chornay, Dennis J.; Pollock, Craig James; Kujawski, Joseph T.; Collinson, Glyn A.; Nguyen, Quang T.;

2011-01-01

The Magnetospheric Multiscale mission (MMS) is designed to study fundamental phenomena in space plasma physics such as a magnetic reconnection. The mission consists of four spacecraft, equipped with identical scientific payloads, allowing for the first measurements of fast dynamics in the critical electron diffusion region where magnetic reconnection occurs and charged particles are demagnetized. The MMS orbit is optimized to ensure the spacecraft spend extended periods of time in locations where reconnection is known to occur: at the dayside magnetopause and in the magnetotail. In order to resolve fine structures of the three dimensional electron distributions in the diffusion region (reconnection site), the Fast Plasma Investigation's (FPI) Dual Electron Spectrometer (DES) is designed to measure three dimensional electron velocity distributions with an extremely high time resolution of 30 ms. In order to achieve this unprecedented sampling rate, four dual spectrometers, each sampling 180 x 45 degree sections of the sky, are installed on each spacecraft. We present results of the comprehensive tests performed on the DES Engineering & Test Unit (ETU). This includes main parameters of the spectrometer such as energy resolution, angular acceptance, and geometric factor along with their variations over the 16 pixels spanning the 180-degree tophat Electro Static Analyzer (ESA) field of view and over the energy of the test beam. A newly developed method for precisely defining the operational space of the instrument is presented as well. This allows optimization of the trade-off between pixel to pixel crosstalk and uniformity of the main spectrometer parameters.

An evaluation of the potential of GPUs to accelerate tracking algorithms for the ATLAS trigger

CERN Document Server

Baines, JTM; The ATLAS collaboration; Emeliyanov, D; Howard, JR; Kama, S; Washbrook, AJ; Wynne, BM

2014-01-01

The potential of GPUs has been evaluated as a possible way to accelerate trigger algorithms for the ATLAS experiment located at the Large Hadron Collider (LHC). During LHC Run-1 ATLAS employed a three-level trigger system to progressively reduce the LHC collision rate of 20 MHz to a storage rate of about 600 Hz for offline processing. Reconstruction of charged particles trajectories through the Inner Detector (ID) was performed at the second (L2) and third (EF) trigger levels. The ID contains pixel, silicon strip (SCT) and straw-tube technologies. Prior to tracking, data-preparation algorithms processed the ID raw data producing measurements of the track position at each detector layer. The data-preparation and tracking consumed almost three-quarters of the total L2 CPU resources during 2012 data-taking. Detailed performance studies of a CUDA™ implementation of the L2 pixel and SCT data-preparation and tracking algorithms running on a Nvidia® Tesla C2050 GPU have shown a speed-up by a factor of 12 for the ...
Tinker-OpenMM: Absolute and relative alchemical free energies using AMOEBA on GPUs.

Science.gov (United States)

Harger, Matthew; Li, Daniel; Wang, Zhi; Dalby, Kevin; Lagardère, Louis; Piquemal, Jean-Philip; Ponder, Jay; Ren, Pengyu

2017-09-05

The capabilities of the polarizable force fields for alchemical free energy calculations have been limited by the high computational cost and complexity of the underlying potential energy functions. In this work, we present a GPU-based general alchemical free energy simulation platform for polarizable potential AMOEBA. Tinker-OpenMM, the OpenMM implementation of the AMOEBA simulation engine has been modified to enable both absolute and relative alchemical simulations on GPUs, which leads to a ∼200-fold improvement in simulation speed over a single CPU core. We show that free energy values calculated using this platform agree with the results of Tinker simulations for the hydration of organic compounds and binding of host-guest systems within the statistical errors. In addition to absolute binding, we designed a relative alchemical approach for computing relative binding affinities of ligands to the same host, where a special path was applied to avoid numerical instability due to polarization between the different ligands that bind to the same site. This scheme is general and does not require ligands to have similar scaffolds. We show that relative hydration and binding free energy calculated using this approach match those computed from the absolute free energy approach. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

Directory of Open Access Journals (Sweden)

Jeyarajan Thiyagalingam

2011-01-01

Full Text Available Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate the mapping of an enhanced motion estimation algorithm to novel GPU-specific architectures, the resulting challenges and benefits therein. Using a database of three-dimensional image sequences, we show that the mapping leads to substantial performance gains, up to a factor of 60, and can provide near-real-time experience. We also show how architectural peculiarities of these devices can be best exploited in the benefit of algorithms, most specifically for addressing the challenges related to their access patterns and different memory configurations. Finally, we evaluate the performance of the algorithm on three different GPU architectures and perform a comprehensive analysis of the results.
A dual computed tomography linear accelerator unit for stereotactic radiation therapy: a new approach without cranially fixated stereotactic frames

International Nuclear Information System (INIS)

Uematsu, Minoru; Fukui, Toshiharu; Shioda, Akira; Tokumitsu, Hideyuki; Takai, Kenji; Kojima, Tadaharu; Asai, Yoshiko; Kusano, Shoichi

1996-01-01

Purpose: To perform stereotactic radiation therapy (SRT) without cranially fixated stereotactic frames, we developed a dual computed tomography (CT) linear accelerator (linac) treatment unit. Methods and Materials: This unit is composed of a linac, CT, and motorized table. The linac and CT are set up at opposite ends of the table, which is suitable for both machines. The gantry axis of the linac is coaxial with that of the CT scanner. Thus, the center of the target detected with the CT can be matched easily with the gantry axis of the linac by rotating the table. Positioning is confirmed with the CT for each treatment session. Positioning and treatment errors with this unit were examined by phantom studies. Between August and December 1994, 8 patients with 11 lesions of primary or metastatic brain tumors received SRT with this unit. All lesions were treated with 24 Gy in three fractions to 30 Gy in 10 fractions to the 80% isodose line, with or without conventional external beam radiation therapy. Results: Phantom studies revealed that treatment errors with this unit were within 1 mm after careful positioning. The position was easily maintained using two tiny metallic balls as vertical and horizontal marks. Motion of patients was negligible using a conventional heat-flexible head mold and dental impression. The overall time for a multiple noncoplanar arcs treatment for a single isocenter was less than 1 h on the initial treatment day and usually less than 20 min on subsequent days. Treatment was outpatient-based and well tolerated with no acute toxicities. Satisfactory responses have been documented. Conclusion: Using this treatment unit, multiple fractionated SRT is performed easily and precisely without cranially fixated stereotactic frames
Dual-stroke heat pump field performance

Science.gov (United States)

Veyo, S. E.

1984-11-01

Two nearly identical proprototype systems, each employing a unique dual-stroke compressor, were built and tested. One was installed in an occupied residence in Jeannette, Pa. It has provided the heating and cooling required from that time to the present. The system has functioned without failure of any prototypical advanced components, although early field experience did suffer from deficiencies in the software for the breadboard micro processor control system. Analysis of field performance data indicates a heating performance factor (HSPF) of 8.13 Stu/Wa, and a cooling energy efficiency (SEER) of 8.35 Scu/Wh. Data indicate that the beat pump is oversized for the test house since the observed lower balance point is 3 F whereas 17 F La optimum. Oversizing coupled with the use of resistance heat ot maintain delivered air temperature warmer than 90 F results in the consumption of more resistance heat than expected, more unit cycling, and therefore lower than expected energy efficiency. Our analysis indicates that with optimal mixing the dual stroke heat pump will yield as HSFF 30% better than a single capacity heat pump representative of high efficiency units in the market place today for the observed weather profile.
Dual Credit/Dual Enrollment and Data Driven Policy Implementation

Science.gov (United States)

Lichtenberger, Eric; Witt, M. Allison; Blankenberger, Bob; Franklin, Doug

2014-01-01

The use of dual credit has been expanding rapidly. Dual credit is a college course taken by a high school student for which both college and high school credit is given. Previous studies provided limited quantitative evidence that dual credit/dual enrollment is directly connected to positive student outcomes. In this study, predictive statistics…
Dual-Doppler Feasibility Study

Science.gov (United States)

Huddleston, Lisa L.

2012-01-01

When two or more Doppler weather radar systems are monitoring the same region, the Doppler velocities can be combined to form a three-dimensional (3-D) wind vector field thus providing for a more intuitive analysis of the wind field. A real-time display of the 3-D winds can assist forecasters in predicting the onset of convection and severe weather. The data can also be used to initialize local numerical weather prediction models. Two operational Doppler Radar systems are in the vicinity of Kennedy Space Center (KSC) and Cape Canaveral Air Force Station (CCAFS); these systems are operated by the 45th Space Wing (45 SW) and the National Weather Service Melbourne, Fla. (NWS MLB). Dual-Doppler applications were considered by the 45 SW in choosing the site for the new radar. Accordingly, the 45th Weather Squadron (45 WS), NWS MLB and the National Aeronautics and Space Administration tasked the Applied Meteorology Unit (AMU) to investigate the feasibility of establishing dual-Doppler capability using the two existing systems. This study investigated technical, hardware, and software requirements necessary to enable the establishment of a dual-Doppler capability. Review of the available literature pertaining to the dual-Doppler technique and consultation with experts revealed that the physical locations and resulting beam crossing angles of the 45 SW and NWS MLB radars make them ideally suited for a dual-Doppler capability. The dual-Doppler equations were derived to facilitate complete understanding of dual-Doppler synthesis; to determine the technical information requirements; and to determine the components of wind velocity from the equation of continuity and radial velocity data collected by the two Doppler radars. Analysis confirmed the suitability of the existing systems to provide the desired capability. In addition, it is possible that both 45 SW radar data and Terminal Doppler Weather Radar data from Orlando International Airport could be used to alleviate any
Reliability analysis of multicellular system architectures for low-cost satellites

Science.gov (United States)

Erlank, A. O.; Bridges, C. P.

2018-06-01

Multicellular system architectures are proposed as a solution to the problem of low reliability currently seen amongst small, low cost satellites. In a multicellular architecture, a set of independent k-out-of-n systems mimic the cells of a biological organism. In order to be beneficial, a multicellular architecture must provide more reliability per unit of overhead than traditional forms of redundancy. The overheads include power consumption, volume and mass. This paper describes the derivation of an analytical model for predicting a multicellular system's lifetime. The performance of such architectures is compared against that of several common forms of redundancy and proven to be beneficial under certain circumstances. In addition, the problem of peripheral interfaces and cross-strapping is investigated using a purpose-developed, multicellular simulation environment. Finally, two case studies are presented based on a prototype cell implementation, which demonstrate the feasibility of the proposed architecture.
The functional neuroanatomy of multitasking: combining dual tasking with a short term memory task.

Science.gov (United States)

Deprez, Sabine; Vandenbulcke, Mathieu; Peeters, Ron; Emsell, Louise; Amant, Frederic; Sunaert, Stefan

2013-09-01

Insight into the neural architecture of multitasking is crucial when investigating the pathophysiology of multitasking deficits in clinical populations. Presently, little is known about how the brain combines dual-tasking with a concurrent short-term memory task, despite the relevance of this mental operation in daily life and the frequency of complaints related to this process, in disease. In this study we aimed to examine how the brain responds when a memory task is added to dual-tasking. Thirty-three right-handed healthy volunteers (20 females, mean age 39.9 ± 5.8) were examined with functional brain imaging (fMRI). The paradigm consisted of two cross-modal single tasks (a visual and auditory temporal same-different task with short delay), a dual-task combining both single tasks simultaneously and a multi-task condition, combining the dual-task with an additional short-term memory task (temporal same-different visual task with long delay). Dual-tasking compared to both individual visual and auditory single tasks activated a predominantly right-sided fronto-parietal network and the cerebellum. When adding the additional short-term memory task, a larger and more bilateral frontoparietal network was recruited. We found enhanced activity during multitasking in components of the network that were already involved in dual-tasking, suggesting increased working memory demands, as well as recruitment of multitask-specific components including areas that are likely to be involved in online holding of visual stimuli in short-term memory such as occipito-temporal cortex. These results confirm concurrent neural processing of a visual short-term memory task during dual-tasking and provide evidence for an effective fMRI multitasking paradigm. © 2013 Elsevier Ltd. All rights reserved.
Design of a dual linear polarization antenna using split ring resonators at X-band

Science.gov (United States)

Ahmed, Sadiq; Chandra, Madhukar

2017-11-01

Dual linear polarization microstrip antenna configurations are very suitable for high-performance satellites, wireless communication and radar applications. This paper presents a new method to improve the co-cross polarization discrimination (XPD) for dual linear polarized microstrip antennas at 10 GHz. For this, three various configurations of a dual linear polarization antenna utilizing metamaterial unit cells are shown. In the first layout, the microstrip patch antenna is loaded with two pairs of spiral ring resonators, in the second model, a split ring resonator is placed between two microstrip feed lines, and in the third design, a complementary split ring resonators are etched in the ground plane. This work has two primary goals: the first is related to the addition of metamaterial unit cells to the antenna structure which permits compensation for an asymmetric current distribution flow on the microstrip antenna and thus yields a symmetrical current distribution on it. This compensation leads to an important enhancement in the XPD in comparison to a conventional dual linear polarized microstrip patch antenna. The simulation reveals an improvement of 7.9, 8.8, and 4 dB in the E and H planes for the three designs, respectively, in the XPD as compared to the conventional dual linear polarized patch antenna. The second objective of this paper is to present the characteristics and performances of the designs of the spiral ring resonator (S-RR), split ring resonator (SRR), and complementary split ring resonator (CSRR) metamaterial unit cells. The simulations are evaluated using the commercial full-wave simulator, Ansoft High-Frequency Structure Simulator (HFSS).
Real-time Vision using FPGAs, GPUs and Multi-core CPUs

DEFF Research Database (Denmark)

Kjær-Nielsen, Anders

the introduction and evolution of a wide variety of powerful hardware architectures have made the developed theory more applicable in performance demanding and real-time applications. Three different architectures have dominated the field due to their parallel capabilities that are often desired when dealing...... processors in the vision community. The introduction of programming languages like CUDA from NVIDIA has made it easier to utilize the high parallel processing powers of the GPU for general purpose computing and thereby realistic to use based on the effort involved with development. The increased clock...... frequencies and number of Configurable Logic Blocks (CLBs) of the FPGAs, as well as the introduction of dedicated hardware implementations like multipliers, Digital Signal Processing (DSP) slices and even embedded hard-core CPU implementations have made them more applicable for general purpose computing...
77 FR 36231 - Americans With Disabilities Act (ADA) and Architectural Barriers Act (ABA) Accessibility...

Science.gov (United States)

2012-06-18

...-0004] RIN 3014-AA39 Americans With Disabilities Act (ADA) and Architectural Barriers Act (ABA... (ADA) and Architectural Barriers Act (ABA) Accessibility Guidelines to specifically address emergency... ensure that newly constructed and altered emergency transportable housing units covered by the ADA or ABA...
Dual-source dual-energy CT for the differentiation of urinary stone composition: preliminary study

International Nuclear Information System (INIS)

Yang Qifang; Zhang Wanshi; Meng Limin; Shi Huiping; Wang Dong; Bi Yongmin; Li Xiangsheng; Fang Hong; Guo Heqing; Yan Jingmin

2011-01-01

Objective: To evaluate dual-source dual-energy CT (DSCT) for the differentiation of' urinary stone composition in vitro. Methods: Ninety-seven urinary stones were obtained by endoscopic lithotripsy and scanned using dual-source dual-energy CT. The stones were divided into six groups according to infrared spectroscopy stone analysis: uric acid (UA) stones (n=10), cystine stones (n=5), struvite stones (n=6), calcium oxalate (CaOx) stones (n=22), mixed UA stones (n=7) and mixed calcium stones (n=47). Hounsfield units (HU) of each stone were recorded for the 80 kV and the 140 kV datasets by hand-drawing method. HU difference, HU ratio and dual energy index (DEI) were calculated and compared among the stone groups with one-way ANOVA. Using dual energy software to determine the composition of all stones, results were compared to infrared spectroscopy analysis. Results: There were statistical differences in HU difference [(-17±13), (229±34), (309±45), (512±97), (201±64) and (530±71) HU respectively], in HU ratio (0.96±0.03, 1.34±0.04, 1.41±0.03, 1.47±0.03, 1.30±0.07, and 1.49±0.03 respectively), and DEI (-0.006±0.004, 0.064±0.007, 0.080± 0.007, 0.108±0.011, 0.055±0.014 and 0.112±0.008 respectively) among different stone groups (F= 124.894, 407.028, 322.864 respectively, P<0.01). There were statistical differences in HU difference, HU ratio and DEI between UA stones and the other groups (P<0.01). There were statistical differences in HU difference, HU ratio and DEI between CaOx or mixed calcium stones and the other four groups (P< 0.01). There was statistical difference in HU ratio between cystine and struvite stones (P<0.01). There were statistical differences in HU difference, HU ratio and DEI between struvite and mixed UA stones (P< 0.05). Dual energy software correctly characterized 10 UA stones, 4 cystine stones, 22 CaOx stones and 6 mixed UA stones. Two struvite stones were considered to contain cystine. One cystine stone, 1 mixed UA stone, 4
The Information-Seeking Habits of Architecture Faculty

Science.gov (United States)

Campbell, Lucy

2017-01-01

This study examines results from a survey of architecture faculty across the United States investigating information-seeking behavior and perceptions of library services. Faculty were asked to rank information sources they used for research, teaching, and creativity within their discipline. Sources were ranked similarly across these activities,…
Research on GPU acceleration for Monte Carlo criticality calculation

International Nuclear Information System (INIS)

Xu, Q.; Yu, G.; Wang, K.

2013-01-01

The Monte Carlo (MC) neutron transport method can be naturally parallelized by multi-core architectures due to the dependency between particles during the simulation. The GPU+CPU heterogeneous parallel mode has become an increasingly popular way of parallelism in the field of scientific supercomputing. Thus, this work focuses on the GPU acceleration method for the Monte Carlo criticality simulation, as well as the computational efficiency that GPUs can bring. The 'neutron transport step' is introduced to increase the GPU thread occupancy. In order to test the sensitivity of the MC code's complexity, a 1D one-group code and a 3D multi-group general purpose code are respectively transplanted to GPUs, and the acceleration effects are compared. The result of numerical experiments shows considerable acceleration effect of the 'neutron transport step' strategy. However, the performance comparison between the 1D code and the 3D code indicates the poor scalability of MC codes on GPUs. (authors)

Centaure: an heterogeneous parallel architecture for computer vision

International Nuclear Information System (INIS)

Peythieux, Marc

1997-01-01

This dissertation deals with the architecture of parallel computers dedicated to computer vision. In the first chapter, the problem to be solved is presented, as well as the architecture of the Sympati and Symphonie computers, on which this work is based. The second chapter is about the state of the art of computers and integrated processors that can execute computer vision and image processing codes. The third chapter contains a description of the architecture of Centaure. It has an heterogeneous structure: it is composed of a multiprocessor system based on Analog Devices ADSP21060 Sharc digital signal processor, and of a set of Symphonie computers working in a multi-SIMD fashion. Centaure also has a modular structure. Its basic node is composed of one Symphonie computer, tightly coupled to a Sharc thanks to a dual ported memory. The nodes of Centaure are linked together by the Sharc communication links. The last chapter deals with a performance validation of Centaure. The execution times on Symphonie and on Centaure of a benchmark which is typical of industrial vision, are presented and compared. In the first place, these results show that the basic node of Centaure allows a faster execution than Symphonie, and that increasing the size of the tested computer leads to a better speed-up with Centaure than with Symphonie. In the second place, these results validate the choice of running the low level structure of Centaure in a multi- SIMD fashion. (author) [fr
Neural architecture design based on extreme learning machine.

Science.gov (United States)

Bueno-Crespo, Andrés; García-Laencina, Pedro J; Sancho-Gómez, José-Luis

2013-12-01

Selection of the optimal neural architecture to solve a pattern classification problem entails to choose the relevant input units, the number of hidden neurons and its corresponding interconnection weights. This problem has been widely studied in many research works but their solutions usually involve excessive computational cost in most of the problems and they do not provide a unique solution. This paper proposes a new technique to efficiently design the MultiLayer Perceptron (MLP) architecture for classification using the Extreme Learning Machine (ELM) algorithm. The proposed method provides a high generalization capability and a unique solution for the architecture design. Moreover, the selected final network only retains those input connections that are relevant for the classification task. Experimental results show these advantages. Copyright © 2013 Elsevier Ltd. All rights reserved.
Wavy Architecture Thin-Film Transistor for Ultrahigh Resolution Flexible Displays

KAUST Repository

Hanna, Amir Nabil; Kutbee, Arwa Talal; Subedi, Ram Chandra; Ooi, Boon S.; Hussain, Muhammad Mustafa

2017-01-01

A novel wavy-shaped thin-film-transistor (TFT) architecture, capable of achieving 70% higher drive current per unit chip area when compared with planar conventional TFT architectures, is reported for flexible display application. The transistor, due to its atypical architecture, does not alter the turn-on voltage or the OFF current values, leading to higher performance without compromising static power consumption. The concept behind this architecture is expanding the transistor's width vertically through grooved trenches in a structural layer deposited on a flexible substrate. Operation of zinc oxide (ZnO)-based TFTs is shown down to a bending radius of 5 mm with no degradation in the electrical performance or cracks in the gate stack. Finally, flexible low-power LEDs driven by the respective currents of the novel wavy, and conventional coplanar architectures are demonstrated, where the novel architecture is able to drive the LED at 2 × the output power, 3 versus 1.5 mW, which demonstrates the potential use for ultrahigh resolution displays in an area efficient manner.
Wavy Architecture Thin-Film Transistor for Ultrahigh Resolution Flexible Displays

KAUST Repository

Hanna, Amir Nabil

2017-11-13

A novel wavy-shaped thin-film-transistor (TFT) architecture, capable of achieving 70% higher drive current per unit chip area when compared with planar conventional TFT architectures, is reported for flexible display application. The transistor, due to its atypical architecture, does not alter the turn-on voltage or the OFF current values, leading to higher performance without compromising static power consumption. The concept behind this architecture is expanding the transistor\\'s width vertically through grooved trenches in a structural layer deposited on a flexible substrate. Operation of zinc oxide (ZnO)-based TFTs is shown down to a bending radius of 5 mm with no degradation in the electrical performance or cracks in the gate stack. Finally, flexible low-power LEDs driven by the respective currents of the novel wavy, and conventional coplanar architectures are demonstrated, where the novel architecture is able to drive the LED at 2 × the output power, 3 versus 1.5 mW, which demonstrates the potential use for ultrahigh resolution displays in an area efficient manner.
The use of sustainable architectural methods in designing a dwelling unit in Tripoli-Libya

International Nuclear Information System (INIS)

Almansouri, A. A.; El-Menghawi, F.

2006-01-01

The new urban built environment is considered as the most energy consuming theme as the industrial and technical devices have been applied without realising their side effects. Therefore, building started to function depending mostly on the mechanical equipment deemed to provide a comfortable atmosphere. As a consequence, this has led to, firstly, many ecological problems such as: the over and misuse of the energy resources, pollution, pollution and diseases. Secondly, an ignorance of the importance of the local climatic conditions, which contribute in defining an identity related specifically to every climatic region. This will result in having a typicality that makes buildings have the same features all around the world regardless the cultural, social and physical differences. In Libya, some issues related to this subject are neglected or rarely studied. This paper, therefore, aims to highlight some architectural solutions that contribute extremely in reducing building's energy consumption as well as creating an architecture related to the local environment. In this paper. an overview of the general architecture principles and a study of the components of the environmental treatments for different climatic zones will be given. A proposal model of a house for a Libyan family lives in Tripoli will be shown in order to give an idea about the application of some architectural treatments for sustainable buildings taking into consideration the physical, cultural and social differences.(Author)
Islanding Control Architecture in future smart grid with both demand and wind turbine control

DEFF Research Database (Denmark)

Chen, Yu; Xu, Zhao; Østergaard, Jacob

2013-01-01

, which is the focus of this paper, available resources including both DG units and demand should be fully utilized as reserves. The control and coordination among different resources requires an integral architecture to serve the purpose. This paper develops the Islanding Control Architecture (ICA...
Dual RING E3 Architectures Regulate Multiubiquitination and Ubiquitin Chain Elongation by APC/C.

Science.gov (United States)

Brown, Nicholas G; VanderLinden, Ryan; Watson, Edmond R; Weissmann, Florian; Ordureau, Alban; Wu, Kuen-Phon; Zhang, Wei; Yu, Shanshan; Mercredi, Peter Y; Harrison, Joseph S; Davidson, Iain F; Qiao, Renping; Lu, Ying; Dube, Prakash; Brunner, Michael R; Grace, Christy R R; Miller, Darcie J; Haselbach, David; Jarvis, Marc A; Yamaguchi, Masaya; Yanishevski, David; Petzold, Georg; Sidhu, Sachdev S; Kuhlman, Brian; Kirschner, Marc W; Harper, J Wade; Peters, Jan-Michael; Stark, Holger; Schulman, Brenda A

2016-06-02

Protein ubiquitination involves E1, E2, and E3 trienzyme cascades. E2 and RING E3 enzymes often collaborate to first prime a substrate with a single ubiquitin (UB) and then achieve different forms of polyubiquitination: multiubiquitination of several sites and elongation of linkage-specific UB chains. Here, cryo-EM and biochemistry show that the human E3 anaphase-promoting complex/cyclosome (APC/C) and its two partner E2s, UBE2C (aka UBCH10) and UBE2S, adopt specialized catalytic architectures for these two distinct forms of polyubiquitination. The APC/C RING constrains UBE2C proximal to a substrate and simultaneously binds a substrate-linked UB to drive processive multiubiquitination. Alternatively, during UB chain elongation, the RING does not bind UBE2S but rather lures an evolving substrate-linked UB to UBE2S positioned through a cullin interaction to generate a Lys11-linked chain. Our findings define mechanisms of APC/C regulation, and establish principles by which specialized E3-E2-substrate-UB architectures control different forms of polyubiquitination. Copyright © 2016 Elsevier Inc. All rights reserved.
Parallelization and checkpointing of GPU applications through program transformation

Energy Technology Data Exchange (ETDEWEB)

Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

2012-01-01

GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and
DeepSAT: A Deep Learning Approach to Tree-cover Delineation in 1-m NAIP Imagery for the Continental United States

Science.gov (United States)

Ganguly, S.; Basu, S.; Nemani, R. R.; Mukhopadhyay, S.; Michaelis, A.; Votava, P.

2016-12-01

High resolution tree cover classification maps are needed to increase the accuracy of current land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) tree cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agriculture Imagery Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with 60 million pixels) and has a total size of 65 terabytes for a single acquisition. Features extracted from the entire dataset would amount to 8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. Using the NASA Earth Exchange (NEX) initiative, we have developed an end-to-end architecture by integrating a segmentation module based on Statistical Region Merging, a classification algorithm using Deep Belief Network and a structured prediction algorithm using Conditional Random Fields to integrate the results from the segmentation and classification modules to create per-pixel class labels. The training process is scaled up using the power of GPUs and the prediction is scaled to quarter million NAIP tiles spanning the whole of Continental United States using the NEX HPC supercomputing cluster. An initial pilot over the
DeepSAT: A Deep Learning Approach to Tree-Cover Delineation in 1-m NAIP Imagery for the Continental United States

Science.gov (United States)

Ganguly, Sangram; Basu, Saikat; Nemani, Ramakrishna R.; Mukhopadhyay, Supratik; Michaelis, Andrew; Votava, Petr

2016-01-01

High resolution tree cover classification maps are needed to increase the accuracy of current land ecosystem and climate model outputs. Limited studies are in place that demonstrates the state-of-the-art in deriving very high resolution (VHR) tree cover products. In addition, most methods heavily rely on commercial softwares that are difficult to scale given the region of study (e.g. continents to globe). Complexities in present approaches relate to (a) scalability of the algorithm, (b) large image data processing (compute and memory intensive), (c) computational cost, (d) massively parallel architecture, and (e) machine learning automation. In addition, VHR satellite datasets are of the order of terabytes and features extracted from these datasets are of the order of petabytes. In our present study, we have acquired the National Agriculture Imagery Program (NAIP) dataset for the Continental United States at a spatial resolution of 1-m. This data comes as image tiles (a total of quarter million image scenes with 60 million pixels) and has a total size of 65 terabytes for a single acquisition. Features extracted from the entire dataset would amount to 8-10 petabytes. In our proposed approach, we have implemented a novel semi-automated machine learning algorithm rooted on the principles of "deep learning" to delineate the percentage of tree cover. Using the NASA Earth Exchange (NEX) initiative, we have developed an end-to-end architecture by integrating a segmentation module based on Statistical Region Merging, a classification algorithm using Deep Belief Network and a structured prediction algorithm using Conditional Random Fields to integrate the results from the segmentation and classification modules to create per-pixel class labels. The training process is scaled up using the power of GPUs and the prediction is scaled to quarter million NAIP tiles spanning the whole of Continental United States using the NEX HPC supercomputing cluster. An initial pilot over the
Actuator digital interface unit (AIU). [control units for space shuttle data system

Science.gov (United States)

1973-01-01

Alternate versions of the actuator interface unit are presented. One alternate is a dual-failure immune configuration which feeds a look-and-switch dual-failure immune hydraulic system. The other alternate is a single-failure immune configuration which feeds a majority voting hydraulic system. Both systems communicate with the data bus through data terminals dedicated to each user subsystem. Both operational control data and configuration control information are processed in and out of the subsystem via the data terminal which yields the actuator interface subsystem, self-managing within its failure immunity capability.
Intelligence in Construction between Contemporary and Traditional Architecture

Directory of Open Access Journals (Sweden)

Khalid abdul wahab

2016-10-01

Full Text Available The authentic traditional architecture proved that it is very convenient to the environmental and social regulations where it appeared and lasted for hundred of years. This traditional architecture got the intelligence in providing thermal comfort for their occupants by the intelligent usage of the building materials and the intelligent planning and designs which took in consideration the climatic condition and the aerodynamics of the whole city as one ecological system starting from the cold breeze passing through its narrow streets till it enters the dwelling units and glides out through the wind catchers. This architecture had been neglected and replaced by modern imported architecture which had collapsed within few decades not in our region only but all over the world. Now a days the increasing awareness of the global environmental crisis the green or sustainable architecture appeared as a new type of architecture by the eighties of the last century. Today these types of buildings began to enter Arab countries in general and Arabian Gulf countries especially like Masdar city, where smart building can be seen with inelegant treatments which will replace the modern high rise towers with its glass facades. This will lead to the general research problem which concerns the shortage of knowledge about the inelegance in design and planning used in the traditional architecture, and the special research problem concerning the shortage of knowledge concerning the ability to use the inelegant traditional architectural and planning concepts as a base for the future sustainable modern buildings and cites as the aim of this research.
The network architecture and site test of DCIS in Lungmen nuclear power station

International Nuclear Information System (INIS)

Lee, C. K.

2006-01-01

The Lungmen Nuclear Power Station (LMNPS) is located in North-Eastern Seashore of Taiwan. LMNPP has two units. Each unit generates 1350 Megawatts. It is the first ABWR Plant in Taiwan and is under-construction now. Due to contractual arrangement, there are seven large I and C suppliers/designers, which are GE NUMAC, DRS, Invensys, GEIS, Hitachi, MHI, and Stone and Webster company. The Distributed Control and Information System (DCIS) in Lungmen are fully integrated with the state-of-the-art computer and network technology. General Electric is the leading designer for integration of DCIS. This paper presents Network Architecture and the Site Test of DCIS. The network architectures are follows. GE NUMAC System adopts the point to point architecture, DRS System adopts Ring type architecture with SCRAMNET protocol, Inevnsys system adopts IGiga Byte Backbone mesh network with Rapid Spanning Tree Protocol, GEIS adopts Ethernet network with EGD protocol, Hitachi adopts ring type network with proprietary protocol. MHI adopt Ethernet network with UDP. The data-links are used for connection between different suppliers. The DCIS architecture supports the plant automation, the alarm prioritization and alarm suppression, and uniform MMI screen for entire plant. The Test Program regarding the integration of different network architectures and Initial DCIS architecture Setup for 161KV Energization will be discussed. Test tool for improving site test schedule, and lessons learned from FAT will be discussed too. And conclusions are at the end of this paper. (authors)
The network architecture and site test of DCIS in Lungmen nuclear power station

Energy Technology Data Exchange (ETDEWEB)

Lee, C. K. [Instrument and Control Section, Lungmen Nuclear Power Station, Taiwan Power Company, Taipei County Taiwan (China)

2006-07-01

The Lungmen Nuclear Power Station (LMNPS) is located in North-Eastern Seashore of Taiwan. LMNPP has two units. Each unit generates 1350 Megawatts. It is the first ABWR Plant in Taiwan and is under-construction now. Due to contractual arrangement, there are seven large I and C suppliers/designers, which are GE NUMAC, DRS, Invensys, GEIS, Hitachi, MHI, and Stone and Webster company. The Distributed Control and Information System (DCIS) in Lungmen are fully integrated with the state-of-the-art computer and network technology. General Electric is the leading designer for integration of DCIS. This paper presents Network Architecture and the Site Test of DCIS. The network architectures are follows. GE NUMAC System adopts the point to point architecture, DRS System adopts Ring type architecture with SCRAMNET protocol, Inevnsys system adopts IGiga Byte Backbone mesh network with Rapid Spanning Tree Protocol, GEIS adopts Ethernet network with EGD protocol, Hitachi adopts ring type network with proprietary protocol. MHI adopt Ethernet network with UDP. The data-links are used for connection between different suppliers. The DCIS architecture supports the plant automation, the alarm prioritization and alarm suppression, and uniform MMI screen for entire plant. The Test Program regarding the integration of different network architectures and Initial DCIS architecture Setup for 161KV Energization will be discussed. Test tool for improving site test schedule, and lessons learned from FAT will be discussed too. And conclusions are at the end of this paper. (authors)
Geometrical objects architecture and the mathematical sciences 1400-1800

CERN Document Server

2014-01-01

This volume explores the mathematical character of architectural practice in diverse pre- and early modern contexts. It takes an explicitly interdisciplinary approach, which unites scholarship in early modern architecture with recent work in the history of science, in particular, on the role of practice in the scientific revolution. As a contribution to architectural history, the volume contextualizes design and construction in terms of contemporary mathematical knowledge, attendant forms of mathematical practice, and relevant social distinctions between the mathematical professions. As a contribution to the history of science, the volume presents a series of micro-historical studies that highlight issues of process, materiality, and knowledge production in specific, situated, practical contexts. Our approach sees the designer’s studio, the stone-yard, the drawing floor, and construction site not merely as places where the architectural object takes shape, but where mathematical knowledge itself is depl...
Aggregation of Individual Sensing Units for Signal Accumulation: Conversion of Liquid-Phase Colorimetric Assay into Enhanced Surface-Tethered Electrochemical Analysis.

Science.gov (United States)

Wei, Tianxiang; Dong, Tingting; Wang, Zhaoyin; Bao, Jianchun; Tu, Wenwen; Dai, Zhihui

2015-07-22

A novel concept is proposed for converting liquid-phase colorimetric assay into enhanced surface-tethered electrochemical analysis, which is based on the analyte-induced formation of a network architecture of metal nanoparticles (MNs). In a proof-of-concept trial, thymine-functionalized silver nanoparticle (Ag-T) is designed as the sensing unit for Hg(2+) determination. Through a specific T-Hg(2+)-T coordination, the validation system based on functionalized sensing units not only can perform well in a colorimetric Hg(2+) assay, but also can be developed into a more sensitive and stable electrochemical Hg(2+) sensor. In electrochemical analysis, the simple principle of analyte-induced aggregation of MNs can be used as a dual signal amplification strategy for significantly improving the detection sensitivity. More importantly, those numerous and diverse colorimetric assays that rely on the target-induced aggregation of MNs can be augmented to satisfy the ambitious demands of sensitive analysis by converting them into electrochemical assays via this approach.
FAST CALCULATION OF THE LOMB-SCARGLE PERIODOGRAM USING GRAPHICS PROCESSING UNITS

International Nuclear Information System (INIS)

Townsend, R. H. D.

2010-01-01

I introduce a new code for fast calculation of the Lomb-Scargle periodogram that leverages the computing power of graphics processing units (GPUs). After establishing a background to the newly emergent field of GPU computing, I discuss the code design and narrate key parts of its source. Benchmarking calculations indicate no significant differences in accuracy compared to an equivalent CPU-based code. However, the differences in performance are pronounced; running on a low-end GPU, the code can match eight CPU cores, and on a high-end GPU it is faster by a factor approaching 30. Applications of the code include analysis of long photometric time series obtained by ongoing satellite missions and upcoming ground-based monitoring facilities, and Monte Carlo simulation of periodogram statistical properties.
Non metrizable topologies on Z with countable dual group.

Directory of Open Access Journals (Sweden)

Daniel de la Barrera Mayoral

2017-04-01

Full Text Available In this paper we give two families of non-metrizable topologies on the group of the integers having a countable dual group which is isomorphic to a infinite torsion subgroup of the unit circle in the complex plane. Both families are related to D-sequences, which are sequences of natural numbers such that each term divides the following. The first family consists of locally quasi-convex group topologies. The second consists of complete topologies which are not locally quasi-convex. In order to study the dual groups for both families we need to make numerical considerations of independent interest.
Improving Software Performance in the Compute Unified Device Architecture

Directory of Open Access Journals (Sweden)

Alexandru PIRJAN

2010-01-01

Full Text Available This paper analyzes several aspects regarding the improvement of software performance for applications written in the Compute Unified Device Architecture CUDA. We address an issue of great importance when programming a CUDA application: the Graphics Processing Unit’s (GPU’s memory management through ranspose ernels. We also benchmark and evaluate the performance for progressively optimizing a transposing matrix application in CUDA. One particular interest was to research how well the optimization techniques, applied to software application written in CUDA, scale to the latest generation of general-purpose graphic processors units (GPGPU, like the Fermi architecture implemented in the GTX480 and the previous architecture implemented in GTX280. Lately, there has been a lot of interest in the literature for this type of optimization analysis, but none of the works so far (to our best knowledge tried to validate if the optimizations can apply to a GPU from the latest Fermi architecture and how well does the Fermi architecture scale to these software performance improving techniques.
GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS

International Nuclear Information System (INIS)

Schive, H.-Y.; Tsai, Y.-C.; Chiueh Tzihong

2010-01-01

We present the newly developed code, GPU-accelerated Adaptive-MEsh-Refinement code (GAMER), which adopts a novel approach in improving the performance of adaptive-mesh-refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing total variation diminishing scheme for the hydrodynamic solver and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between the CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is diminished by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using one GPU with 4096 3 effective resolution and 16 GPUs with 8192 3 effective resolution, respectively.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.