WorldWideScience

Sample records for fast multicore processor

  1. A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

    Energy Technology Data Exchange (ETDEWEB)

    Aliaga, José I., E-mail: aliaga@uji.es [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Alonso, Pedro [Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València (Spain); Badía, José M. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Chacón, Pablo [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Davidović, Davor [Rudjer Bošković Institute, Centar za Informatiku i Računarstvo – CIR, Zagreb (Croatia); López-Blanco, José R. [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Quintana-Ortí, Enrique S. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain)

    2016-03-15

    We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.

  2. A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

    International Nuclear Information System (INIS)

    Aliaga, José I.; Alonso, Pedro; Badía, José M.; Chacón, Pablo; Davidović, Davor; López-Blanco, José R.; Quintana-Ortí, Enrique S.

    2016-01-01

    We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.

  3. Multi-Core Processor Memory Contention Benchmark Analysis Case Study

    Science.gov (United States)

    Simon, Tyler; McGalliard, James

    2009-01-01

    Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.

  4. On the effective parallel programming of multi-core processors

    NARCIS (Netherlands)

    Varbanescu, A.L.

    2010-01-01

    Multi-core processors are considered now the only feasible alternative to the large single-core processors which have become limited by technological aspects such as power consumption and heat dissipation. However, due to their inherent parallel structure and their diversity, multi-cores are

  5. A High Performance Multi-Core FPGA Implementation for 2D Pixel Clustering for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Beretta, M; Gkaitatzis, S; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The high performance multi-core 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors read out drivers (RODs) at 760Gbps, the full rate of level 1 triggers. Clustering is required as a method to reduce the high rate of the received data before further processing, as well as to determine the cluster centroid for obtaining obtain the best spatial measurement. Our implementation targets the pixel detectors and uses a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The design is fully generic and the cluster detection window size can be adjusted for optimizing the cluster identification process. Τhe implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility mak...

  6. Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

    Science.gov (United States)

    Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

    2017-11-01

    Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.

  7. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    Science.gov (United States)

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  8. Interactive high-resolution isosurface ray casting on multicore processors.

    Science.gov (United States)

    Wang, Qin; JaJa, Joseph

    2008-01-01

    We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.

  9. Hardware Synchronization for Embedded Multi-Core Processors

    DEFF Research Database (Denmark)

    Stoif, Christian; Schoeberl, Martin; Liccardi, Benito

    2011-01-01

    Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper, establi......Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper...

  10. State-based Communication on Time-predictable Multicore Processors

    DEFF Research Database (Denmark)

    Sørensen, Rasmus Bo; Schoeberl, Martin; Sparsø, Jens

    2016-01-01

    Some real-time systems use a form of task-to-task communication called state-based or sample-based communication that does not impose any flow control among the communicating tasks. The concept is similar to a shared variable, where a reader may read the same value multiple times or may not read...... a given value at all. This paper explores time-predictable implementations of state-based communication in network-on-chip based multicore platforms through five algorithms. With the presented analysis of the implemented algorithms, the communicating tasks of one core can be scheduled independently...... of tasks on other cores. Assuming a specific time-predictable multicore processor, we evaluate how the read and write primitives of the five algorithms contribute to the worst-case execution time of the communicating tasks. Each of the five algorithms has specific capabilities that make them suitable...

  11. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    Science.gov (United States)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  12. Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

    Science.gov (United States)

    2013-09-30

    1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the

  13. Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

    Science.gov (United States)

    Kim, Deokho; Park, Karam; Ro, Won W.

    2011-01-01

    While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053

  14. A lock circuit for a multi-core processor

    DEFF Research Database (Denmark)

    2015-01-01

    An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores...... that are enqueued in the queue register. Furthermore, the integrated circuit comprises a current register and a selector circuit configured to select a processor core and identify that processor core by a value in the current register. A selected processor core is a prioritized processor core among the cores...... configured with an integrated circuit; and a silicon die configured with an integrated circuit....

  15. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira

    2015-01-01

    The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation...... of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multi-core skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads...

  16. Fast processor for dilepton triggers

    International Nuclear Information System (INIS)

    Katsanevas, S.; Kostarakis, P.; Baltrusaitis, R.

    1983-01-01

    We describe a fast trigger processor, developed for and used in Fermilab experiment E-537, for selecting high-mass dimuon events produced by negative pions and anti-protons. The processor finds candidate tracks by matching hit information received from drift chambers and scintillation counters, and determines their momenta. Invariant masses are calculated for all possible pairs of tracks and an event is accepted if any invariant mass is greater than some preselectable minimum mass. The whole process, accomplished within 5 to 10 microseconds, achieves up to a ten-fold reduction in trigger rate

  17. Architecture-Aware Optimization of an HEVC decoder on Asymmetric Multicore Processors

    OpenAIRE

    Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

    2016-01-01

    Low-power asymmetric multicore processors (AMPs) attract considerable attention due to their appealing performance-power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important ...

  18. Mathematical Methods and Algorithms of Mobile Parallel Computing on the Base of Multi-core Processors

    Directory of Open Access Journals (Sweden)

    Alexander B. Bakulev

    2012-11-01

    Full Text Available This article deals with mathematical models and algorithms, providing mobility of sequential programs parallel representation on the high-level language, presents formal model of operation environment processes management, based on the proposed model of programs parallel representation, presenting computation process on the base of multi-core processors.

  19. Pixel detector bias supply and control using embedded multicore processors

    CERN Document Server

    AUTHOR|(CDS)2099144; Akram Alomainy

    The aim of the project is to create a software controlled, open source, low footprint and low power high voltage bias supply and current monitor for a pixelated radiation sensor. The solution is based on the LT3905 integrated circuit and the multi-core XMOS xCore 200 microcontroller and it is intended to be used in a battery powered, mobile platform for educational settings.

  20. Tinuso: A processor architecture for a multi-core hardware simulation platform

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; Karlsson, Sven

    2010-01-01

    Multi-core systems have the potential to improve performance, energy and cost properties of embedded systems but also require new design methods and tools to take advantage of the new architectures. Due to the limited accuracy and performance of pure software simulators, we are working on a cycle...... accurate hardware simulation platform. We have developed the Tinuso processor architecture for this platform. Tinuso is a processor architecture optimized for FPGA implementation. The instruction set makes use of predicated instructions and supports C/C++ and assembly language programming. It is designed...... to be easy extendable to maintain the exibility required for the research on multi-core systems. Tinuso contains a co-processor interface to connect to a network interface. This interface allow for communication over an on-chip network. A clock frequency estimation study on a deeply pipelined Tinuso...

  1. Support for the Logical Execution Time Model on a Time-predictable Multicore Processor

    DEFF Research Database (Denmark)

    Kluge, Florian; Schoeberl, Martin; Ungerer, Theo

    2016-01-01

    The logical execution time (LET) model increases the compositionality of real-time task sets. Removal or addition of tasks does not influence the communication behavior of other tasks. In this work, we extend a multicore operating system running on a time-predictable multicore processor to support...... the LET model. For communication between tasks we use message passing on a time-predictable network-on-chip to avoid the bottleneck of shared memory. We report our experiences and present results on the costs in terms of memory and execution time....

  2. Timing organization of a real-time multicore processor

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Sparsø, Jens

    2017-01-01

    Real-time systems need a time-predictable computing platform. Computation, communication, and access to shared resources needs to be time-predictable. We use time division multiplexing to statically schedule all computation and communication resources, such as access to main memory or message...... passing over a network-on-chip. We use time-driven communication over an asynchronous network-on-chip to enable time division multiplexing even in a globally asynchronous, locally synchronous multicore architecture. Using time division multiplexing at all levels of the architecture yields in a time...

  3. Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

    OpenAIRE

    Catalán, Sandra; Herrero, José R.; Igual, Francisco D.; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

    2015-01-01

    Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the...

  4. Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    OpenAIRE

    Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

    2015-01-01

    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware ...

  5. Climbing Mont Blanc - A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

    OpenAIRE

    Natvig, Lasse; Follan, Torbjørn; Støa, Simen; Magnussen, Sindre; Guirado, Antonio Garcia

    2015-01-01

    Climbing Mont Blanc (CMB) is an open online judge used for training in energy efficient programming of state-of-the-art heterogeneous multicores. It uses an Odroid-XU3 board from Hardkernel with an Exynos Octa processor and integrated power sensors. This processor is three-way heterogeneous containing 14 different cores of three different types. The board currently accepts C and C++ programs, with support for OpenCL v1.1, OpenMP 4.0 and Pthreads. Programs submitted using the graphical user in...

  6. A Workload-Adaptive and Reconfigurable Bus Architecture for Multicore Processors

    Directory of Open Access Journals (Sweden)

    Shoaib Akram

    2010-01-01

    Full Text Available Interconnection networks for multicore processors are traditionally designed to serve a diversity of workloads. However, different workloads or even different execution phases of the same workload may benefit from different interconnect configurations. In this paper, we first motivate the need for workload-adaptive interconnection networks. Subsequently, we describe an interconnection network framework based on reconfigurable switches for use in medium-scale (up to 32 cores shared memory multicore processors. Our cost-effective reconfigurable interconnection network is implemented on a traditional shared bus interconnect with snoopy-based coherence, and it enables improved multicore performance. The proposed interconnect architecture distributes the cores of the processor into clusters with reconfigurable logic between clusters to support workload-adaptive policies for inter-cluster communication. Our interconnection scheme is complemented by interconnect-aware scheduling and additional interconnect optimizations which help boost the performance of multiprogramming and multithreaded workloads. We provide experimental results that show that the overall throughput of multiprogramming workloads (consisting of two and four programs can be improved by up to 60% with our configurable bus architecture. Similar gains can be achieved also for multithreaded applications as shown by further experiments. Finally, we present the performance sensitivity of the proposed interconnect architecture on shared memory bandwidth availability.

  7. Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs

    OpenAIRE

    Butko , Anastasiia; Bruguier , Florent; Gamatié , Abdoulaye; Sassatelli , Gilles

    2017-01-01

    International audience; ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider differ...

  8. Self-powered information measuring wireless networks using the distribution of tasks within multicore processors

    Science.gov (United States)

    Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral

    2017-08-01

    The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.

  9. ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

    DEFF Research Database (Denmark)

    Liu, Weifeng; Vinter, Brian

    2014-01-01

    and its child nodes must be executed sequentially, and (2) heaps, even d-heaps (d-ary heaps or d-way heaps), cannot supply enough wide data parallelism to these processors. Recent research proposed more versatile asymmetric multicore processors (AMPs) that consist of two types of cores (latency......-oriented cores with high single-thread performance and throughput-oriented cores with wide vector processing capability), unified memory address space and faster synchronization mechanism among cores with different ISAs. To leverage the AMPs for the heap data structure, in this paper we propose ad......-heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative...

  10. RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors

    Science.gov (United States)

    Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

    2015-09-01

    This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.

  11. On developing B-spline registration algorithms for multi-core processors

    International Nuclear Information System (INIS)

    Shackleford, J A; Kandasamy, N; Sharp, G C

    2010-01-01

    Spline-based deformable registration methods are quite popular within the medical-imaging community due to their flexibility and robustness. However, they require a large amount of computing time to obtain adequate results. This paper makes two contributions towards accelerating B-spline-based registration. First, we propose a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm. Based on this grid-alignment scheme, we then develop highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs). Particular attention is focused on an optimal method for performing analytic gradient computations in a data parallel fashion. CPU and GPU versions are validated for execution time and registration quality. Performance results on large images show that our GPU algorithm achieves a speedup of 15 times over the single-threaded CPU implementation whereas our multi-core CPU algorithm achieves a speedup of 8 times over the single-threaded implementation. The CPU and GPU versions achieve near-identical registration quality in terms of RMS differences between the generated vector fields.

  12. A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

    Science.gov (United States)

    Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

    2018-02-01

    Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.

  13. Power-Energy Simulation for Multi-Core Processors in Bench-marking

    Directory of Open Access Journals (Sweden)

    Mona A. Abou-Of

    2017-01-01

    Full Text Available At Microarchitectural level, multi-core processor, as a complex System on Chip, has sophisticated on-chip components including cores, shared caches, interconnects and system controllers such as memory and ethernet controllers. At technological level, architects should consider the device types forecast in the International Technology Roadmap for Semiconductors (ITRS. Energy simulation enables architects to study two important metrics simultaneously. Timing is a key element of the CPU performance that imposes constraints on the CPU target clock frequency. Power and the resulting heat impose more severe design constraints, such as core clustering, while semiconductor industry is providing more transistors in the die area in pace with Moore’s law. Energy simulators provide a solution for such serious challenge. Energy is modelled either by combining performance benchmarking tool with a power simulator or by an integrated framework of both performance simulator and power profiling system. This article presents and asses trade-offs between different architectures using four cores battery-powered mobile systems by running a custom-made and a standard benchmark tools. The experimental results assure the Energy/ Frequency convexity rule over a range of frequency settings on different number of enabled cores. The reported results show that increasing the number of cores has a great effect on increasing the power consumption. However, a minimum energy dissipation will occur at a lower frequency which reduces the power consumption. Despite that, increasing the number of cores will also increase the effective cores value which will reflect a better processor performance.

  14. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    Energy Technology Data Exchange (ETDEWEB)

    Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core

  15. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    International Nuclear Information System (INIS)

    Barhen, Jacob; Kerekes, Ryan A.; St Charles, Jesse Lee; Buckner, Mark A.

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core

  16. Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

    Energy Technology Data Exchange (ETDEWEB)

    Schmalz, Mark S

    2011-07-24

    Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G} for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient

  17. Integral Fast Reactor fuel pin processor

    International Nuclear Information System (INIS)

    Levinskas, D.

    1993-01-01

    This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves

  18. 3D Seismic Imaging through Reverse-Time Migration on Homogeneous and Heterogeneous Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Mauricio Araya-Polo

    2009-01-01

    Full Text Available Reverse-Time Migration (RTM is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP in terms of performance (with an 15.0× speedup and energy-efficiency (with a 10.0× increase in the GFlops/W delivered. Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

  19. The ATLAS fast tracker processor design

    CERN Document Server

    Volpi, Guido; Albicocco, Pietro; Alison, John; Ancu, Lucian Stefan; Anderson, James; Andari, Nansi; Andreani, Alessandro; Andreazza, Attilio; Annovi, Alberto; Antonelli, Mario; Asbah, Needa; Atkinson, Markus; Baines, J; Barberio, Elisabetta; Beccherle, Roberto; Beretta, Matteo; Biesuz, Nicolo Vladi; Blair, R E; Bogdan, Mircea; Boveia, Antonio; Britzger, Daniel; Bryant, Partick; Burghgrave, Blake; Calderini, Giovanni; Camplani, Alessandra; Cavaliere, Viviana; Cavasinni, Vincenzo; Chakraborty, Dhiman; Chang, Philip; Cheng, Yangyang; Citraro, Saverio; Citterio, Mauro; Crescioli, Francesco; Dawe, Noel; Dell'Orso, Mauro; Donati, Simone; Dondero, Paolo; Drake, G; Gadomski, Szymon; Gatta, Mauro; Gentsos, Christos; Giannetti, Paola; Gkaitatzis, Stamatios; Gramling, Johanna; Howarth, James William; Iizawa, Tomoya; Ilic, Nikolina; Jiang, Zihao; Kaji, Toshiaki; Kasten, Michael; Kawaguchi, Yoshimasa; Kim, Young Kee; Kimura, Naoki; Klimkovich, Tatsiana; Kolb, Mathis; Kordas, K; Krizka, Karol; Kubota, T; Lanza, Agostino; Li, Ho Ling; Liberali, Valentino; Lisovyi, Mykhailo; Liu, Lulu; Love, Jeremy; Luciano, Pierluigi; Luongo, Carmela; Magalotti, Daniel; Maznas, Ioannis; Meroni, Chiara; Mitani, Takashi; Nasimi, Hikmat; Negri, Andrea; Neroutsos, Panos; Neubauer, Mark; Nikolaidis, Spiridon; Okumura, Y; Pandini, Carlo; Petridou, Chariclia; Piendibene, Marco; Proudfoot, James; Rados, Petar Kevin; Roda, Chiara; Rossi, Enrico; Sakurai, Yuki; Sampsonidis, Dimitrios; Saxon, James; Schmitt, Stefan; Schoening, Andre; Shochet, Mel; Shoijaii, Jafar; Soltveit, Hans Kristian; Sotiropoulou, Calliope-Louisa; Stabile, Alberto; Swiatlowski, Maximilian J; Tang, Fukun; Taylor, Pierre Thor Elliot; Testa, Marianna; Tompkins, Lauren; Vercesi, V; Wang, Rui; Watari, Ryutaro; Zhang, Jianhong; Zeng, Jian Cong; Zou, Rui; Bertolucci, Federico

    2015-01-01

    The extended use of tracking information at the trigger level in the LHC is crucial for the trigger and data acquisition (TDAQ) system to fulfill its task. Precise and fast tracking is important to identify specific decay products of the Higgs boson or new phenomena, as well as to distinguish the contributions coming from the many collisions that occur at every bunch crossing. However, track reconstruction is among the most demanding tasks performed by the TDAQ computing farm; in fact, complete reconstruction at full Level-1 trigger accept rate (100 kHz) is not possible. In order to overcome this limitation, the ATLAS experiment is planning the installation of a dedicated processor, the Fast Tracker (FTK), which is aimed at achieving this goal. The FTK is a pipeline of high performance electronics, based on custom and commercial devices, which is expected to reconstruct, with high resolution, the trajectories of charged-particle tracks with a transverse momentum above 1 GeV, using the ATLAS inner tracker info...

  20. Fast Failure Recovery for Main-Memory DBMSs on Multicores

    OpenAIRE

    Wu, Yingjun; Guo, Wentian; Chan, Chee-Yong; Tan, Kian-Lee

    2016-01-01

    Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery wit...

  1. NMR-MPar: A Fault-Tolerance Approach for Multi-Core and Many-Core Processors

    Directory of Open Access Journals (Sweden)

    Vanessa Vargas

    2018-03-01

    Full Text Available Multi-core and many-core processors are a promising solution to achieve high performance by maintaining a lower power consumption. However, the degree of miniaturization makes them more sensitive to soft-errors. To improve the system reliability, this work proposes a fault-tolerance approach based on redundancy and partitioning principles called N-Modular Redundancy and M-Partitions (NMR-MPar. By combining both principles, this approach allows multi-/many-core processors to perform critical functions in mixed-criticality systems. Benefiting from the capabilities of these devices, NMR-MPar creates different partitions that perform independent functions. For critical functions, it is proposed that N partitions with the same configuration participate of an N-modular redundancy system. In order to validate the approach, a case study is implemented on the KALRAY Multi-Purpose Processing Array (MPPA-256 many-core processor running two parallel benchmark applications. The traveling salesman problem and matrix multiplication applications were selected to test different device’s resources. The effectiveness of NMR-MPar is assessed by software-implemented fault-injection. For evaluation purposes, it is considered that the system is intended to be used in avionics. Results show the improvement of the application reliability by two orders of magnitude when implementing NMR-MPar on the system. Finally, this work opens the possibility to use massive parallelism for dependable applications in embedded systems.

  2. Fast track trigger processor for the OPAL detector at LEP

    Energy Technology Data Exchange (ETDEWEB)

    Carter, A A; Carter, J R; Ward, D R; Heuer, R D; Jaroslawski, S; Wagner, A

    1986-09-20

    A fast hardware track trigger processor being built for the OPAL experiment is described. The processor will analyse data from the central drift chambers of OPAL to determine whether any tracks come from the interaction region, and thereby eliminate background events. The processor will find tracks over a large angular range, vertical strokecos thetavertical stroke < or approx. 0.95. The design of the processor is described, together with a brief account of its hardware implementation for OPAL. The results of feasibility studies are also presented.

  3. Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

    Science.gov (United States)

    2013-02-01

    Playstation 3 with 6 available SPU cores outperforms the Intel Xeon processor (with 4 cores) by about 1.9 times for the HTM model and by 2.4 times...runtime breakdowns of the HTM and Dean models respectively on the Cell processor (on the Playstation 3) and the Intel Xeon processor ( 4 thread...YOUR FORM TO THE ABOVE ORGANIZATION. 1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To) 4 . TITLE AND SUBTITLE 5a. CONTRACT NUMBER

  4. Numerical Solution of Diffusion Models in Biomedical Imaging on Multicore Processors

    Directory of Open Access Journals (Sweden)

    Luisa D'Amore

    2011-01-01

    Full Text Available In this paper, we consider nonlinear partial differential equations (PDEs of diffusion/advection type underlying most problems in image analysis. As case study, we address the segmentation of medical structures. We perform a comparative study of numerical algorithms arising from using the semi-implicit and the fully implicit discretization schemes. Comparison criteria take into account both the accuracy and the efficiency of the algorithms. As measure of accuracy, we consider the Hausdorff distance and the residuals of numerical solvers, while as measure of efficiency we consider convergence history, execution time, speedup, and parallel efficiency. This analysis is carried out in a multicore-based parallel computing environment.

  5. Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

    Energy Technology Data Exchange (ETDEWEB)

    Madduri, Kamesh [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Williams, Samuel [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Ethier, Stephane [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Shalf, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Strohmaier, Erich [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelicky, Katherine [Univ. of California, Berkeley, CA (United States)

    2009-01-01

    We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.

  6. Matrix multiplication with a hypercube algorithm on multi-core processor cluster

    Directory of Open Access Journals (Sweden)

    José Crispín Zavala-Díaz

    2015-01-01

    Full Text Available Se analiza, modifica e implementa el algoritmo de multiplicación de matrices de Dekel, Nassimi y Sahani o hipercubo en un cluster de procesadores multi-core, donde el número de procesadores utilizado es menor al requerido por el algoritmo de n3. Se utilizan 23, 43 y 83 unidades procesadoras para multiplicar matrices de orden de magnitud de 10X10, 102X102 y 103X103. Los resultados del modelo matemático del algoritmo modificado y los obtenidos de la experimentación computacional muestran que es posible alcanzar rapidez y eficiencias paralelas aceptables, en función del número de unidades procesadoras utilizadas. También se muestra que la influencia del enlace externo de comunicación entre los nodos disminuye si se utiliza una combinación de los canales de comunicación disponibles entre los núcleos en un cluster multi-core.

  7. Parallel processor for fast event analysis

    International Nuclear Information System (INIS)

    Hensley, D.C.

    1983-01-01

    Current maximum data rates from the Spin Spectrometer of approx. 5000 events/s (up to 1.3 MBytes/s) and minimum analysis requiring at least 3000 operations/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed where up to 4 independent processors will be implemented in parallel. The individual processors are designed around the Am2910 Microsequencer, the AM29116 μP, and the Am29517 Multiplier. Satellite histogramming in a mass memory system will be managed by a commercial 16-bit μP system

  8. Early Student Support for Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

    Science.gov (United States)

    2016-05-07

    REPORT DOCUMENTATION PAGE I . ... ... .. . ,...,.., ............. OMB No. 0704-0188 The public reporting burden for this collection of...Student Support for Appl ication of Advanced Multi- Core Processor N00014-12-1-0298 Technologies to Oceanographic Research Sb. GRANT NUMBER Sc...communications protocols (i.e. UART, I2C, and SPI), through the , ’ . handing off of the data to the server APis. By providing a common set of tools

  9. Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

    Science.gov (United States)

    Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

    2017-10-01

    Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.

  10. How to harness the performance potential of current multi-core processors

    International Nuclear Information System (INIS)

    Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej

    2011-01-01

    Leakage currents have put a stop to the semiconductor industry's ability to increase processor frequency in order to enhance the performance of new microprocessors. Instead, we observe a slew of changes inside the micro-architecture with an aim of enhancing the performance. Several of these changes, however, do not translate into automatic speed improvements for the software. This paper discusses the increased complexity of modern microprocessors by separating out into dimensions each feature that impacts performance and mentions briefly ways of improving software, in particular that of the High Energy Physics community, to take full advantage.

  11. Recovery Act: Integrated DC-DC Conversion for Energy-Efficient Multicore Processors

    Energy Technology Data Exchange (ETDEWEB)

    Shepard, Kenneth L

    2013-03-31

    devices. These new approaches to scaled voltage regulation for computing devices also promise significant impact on electricity consumption in the United States and abroad by improving the efficiency of all computational platforms. In 2006, servers and datacenters in the United States consumed an estimated 61 billion kWh or about 1.5% of the nation's total energy consumption. Federal Government servers and data centers alone accounted for about 10 billion kWh, for a total annual energy cost of about $450 million. Based upon market growth and efficiency trends, estimates place current server and datacenter power consumption at nearly 85 billion kWh in the US and at almost 280 billion kWh worldwide. Similar estimates place national desktop, mobile and portable computing at 80 billion kWh combined. While national electricity utilization for computation amounts to only 4% of current usage, it is growing at a rate of about 10% a year with volume servers representing one of the largest growth segments due to the increasing utilization of cloud-based services. The percentage of power that is consumed by the processor in a server varies but can be as much as 30% of the total power utilization, with an additional 50% associated with heat removal. The approaches considered here should allow energy efficiency gains as high as 30% in processors for all computing platforms, from high-end servers to smart phones, resulting in a direct annual energy savings of almost 15 billion kWh nationally, and 50 billion kWh globally. The work developed here is being commercialized by the start-up venture, Ferric Semiconductor, which has already secured two Phase I SBIR grants to bring these technologies to the marketplace.

  12. A fast processor for di-lepton triggers

    CERN Document Server

    Kostarakis, P; Barsotti, E; Conetti, S; Cox, B; Enagonio, J; Haldeman, M; Haynes, W; Katsanevas, S; Kerns, C; Lebrun, P; Smith, H; Soszyniski, T; Stoffel, J; Treptow, K; Turkot, F; Wagner, R

    1981-01-01

    As a new application of the Fermilab ECL-CAMAC logic modules a fast trigger processor was developed for Fermilab experiment E-537, aiming to measure the higher mass di-muon production by antiprotons. The processor matches the hit information received from drift chambers and scintillation counters, to find candidate muon tracks and determine their directions and momenta. The tracks are then paired to compute an invariant mass: when the computed mass falls within the desired range, the event is accepted. The process is accomplished in times of 5 to 10 microseconds, while achieving a trigger rate reduction of up to a factor of ten. (5 refs).

  13. The fast tracker processor for hadron collider triggers

    CERN Document Server

    Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, E; Pietri, M; Varotto, G

    2001-01-01

    Perspectives for precise and fast track reconstruction in future hadron collider experiments are addressed. We discuss the feasibility of a pipelined highly parallel processor dedicated to the implementation of a very fast tracking algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points, called patterns, for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at an input rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution tracks with transverse momentum above a few GeV and search for secondary vertices within typical level-2 times. (15 refs).

  14. The fast tracker processor for hadronic collider triggers

    CERN Document Server

    Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, F; Pietri, M; Varotto, G

    2000-01-01

    Perspective for precise and fast track reconstruction in future hadronic collider experiments are addressed. We discuss the feasibility of a pipelined highly parallelized processor dedicated to the implementation of a very fast algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points (patterns) for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at a rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution traces with transverse momentum above few GeV and search secondary vertexes within typical level-2 times. 15 Refs.

  15. The Fast Tracker Real Time Processor

    CERN Document Server

    Annovi, A; The ATLAS collaboration

    2011-01-01

    As the LHC luminosity is ramped up to the SLHC Phase I level and beyond, the high rates, multiplicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the physics we are most interested in, and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK)[1], is a proposed upgrade to the current ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK solves the combinatorial challenge inherent to tracking by exploiting massive parallelism of associative memories [2] that ...

  16. Fundamentals of multicore software development

    CERN Document Server

    Pankratius, Victor; Tichy, Walter F

    2011-01-01

    With multicore processors now in every computer, server, and embedded device, the need for cost-effective, reliable parallel software has never been greater. By explaining key aspects of multicore programming, Fundamentals of Multicore Software Development helps software engineers understand parallel programming and master the multicore challenge. Accessible to newcomers to the field, the book captures the state of the art of multicore programming in computer science. It covers the fundamentals of multicore hardware, parallel design patterns, and parallel programming in C++, .NET, and Java. It

  17. A Fast CT Reconstruction Scheme for a General Multi-Core PC

    Directory of Open Access Journals (Sweden)

    Kai Zeng

    2007-01-01

    Full Text Available Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT angiography (BCA that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

  18. A fast CT reconstruction scheme for a general multi-core PC.

    Science.gov (United States)

    Zeng, Kai; Bai, Erwei; Wang, Ge

    2007-01-01

    Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

  19. A fast inner product processor based on equal alignments

    Energy Technology Data Exchange (ETDEWEB)

    Smith, S.P.; Torng, H.C.

    1985-11-01

    Inner product computation is an important operation, invoked repeatedly in matrix multiplications. A high-speed inner product processor can be very useful (among many possible applications) in real-time signal processing. This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost. The inner product processor is implemented with a tree of carry-propagate or carry-save adders; this structure is obtained with the incorporation of three innovations in the conventional multiply/add tree: The leaf-multipliers are expanded into adder subtrees, thus achieving an O(log Nb) latency, where N denotes the number of elements in a vector and b the number of bits in each element. The partial products, to be summed in producing an inner product, are reordered according to their ''minimum alignments.'' This reordering brings approximately a 20% savings in hardware-including adders and data paths. The reduction in adder widths also yields savings in carry propagation time for carry-propagate adders. For trees implemented with carry-save adders, the partial product reordering also serves to truncate the carry propagation chain in the final propagation stage by 2 log b - 1 positions, thus significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware.

  20. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...

  1. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...

  2. Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

    Directory of Open Access Journals (Sweden)

    Pielot Rainer

    2010-01-01

    Full Text Available Abstract Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE, a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.

  3. Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

    Science.gov (United States)

    Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

    2010-01-11

    Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.

  4. Fast digital processor for event selection according to particle number difference

    International Nuclear Information System (INIS)

    Basiladze, S.G.; Gus'kov, B.N.; Li Van Sun; Maksimov, A.N.; Parfenov, A.N.

    1978-01-01

    A fast digital processor for a magnetic spectrometer is described. It is used in experimental searches for charmed particles. The basic purpose of the processor is discriminating events in the difference of numbers of particles passing through two proportional chambers (PC). The processor consists of three units for detecting signals with PC, and a binary coder. The number of inputs of the processor is 32 for the first PC and 64 for the second. The difference in the number of particles discriminated is from 0 to 8. The resolution time is 180 ns. The processor is built in the CAMAC standard

  5. Multithreading for Embedded Reconfigurable Multicore Systems

    NARCIS (Netherlands)

    Zaykov, P.G.

    2014-01-01

    In this dissertation, we address the problem of performance efficient multithreading execution on heterogeneous multicore embedded systems. By heterogeneous multicore embedded systems we refer to those, which have real-time requirements and consist of processor tiles with General Purpose Processor

  6. Multithreading for embedded reconfigurable multicore systems

    NARCIS (Netherlands)

    Zaykov, P.G.

    2014-01-01

    In this dissertation, we address the problem of performance efficient multithreading execution on heterogeneous multicore embedded systems. By heterogeneous multicore embedded systems we refer to those, which have real-time requirements and consist of processor tiles with General Purpose Processor

  7. MORPION: a fast hardware processor for straight line finding in MWPC

    International Nuclear Information System (INIS)

    Mur, M.

    1980-02-01

    A fast hardware processor for straight line finding in MWPC has been built in Saclay and successfully operated in the NA3 experiment at CERN. We give the motivations to build this processor, and describe the hardware implementation of the line finding algorithm. Finally its use and performance in NA3 are described

  8. Fast parallel computation of polynomials using few processors

    DEFF Research Database (Denmark)

    Valiant, Leslie; Skyum, Sven

    1981-01-01

    It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....

  9. A fast track trigger processor for the OPAL detector at LEP

    International Nuclear Information System (INIS)

    Carter, A.A.; Jaroslawski, S.; Wagner, A.

    1986-01-01

    A fast hardware track trigger processor being built for the OPAL experiment is described. The processor will analyse data from the central drift chambers of OPAL to determine whether any tracks come from the interaction region, and thereby eliminate background events. The processor will find tracks over a large angular range, vertical strokecos thetavertical stroke < or approx. 0.95. The design of the processor is described, together with a brief account of its hardware implementation for OPAL. The results of feasibility studies are also presented. (orig.)

  10. Analytic processor model for fast design-space exploration

    NARCIS (Netherlands)

    Jongerius, R.; Mariani, G.; Anghel, A.; Dittmann, G.; Vermij, E.; Corporaal, H.

    2015-01-01

    In this paper, we propose an analytic model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy, and returns as output an estimation of processor-core performance. To validate

  11. Fast Parallel Computation of Polynomials Using Few Processors

    DEFF Research Database (Denmark)

    Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.

    1983-01-01

    It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....

  12. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Cipriani, R; Citraro, S; Citterio, M; Colombo, A; Crescioli, F; Dimas, D; Donati, S; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A

    2014-01-01

    The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. Special relevance will be given to the AMchip design that includes two custom cells optimized for low consumption. We repo...

  13. Multicore Considerations for Legacy Flight Software Migration

    Science.gov (United States)

    Vines, Kenneth; Day, Len

    2013-01-01

    In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.

  14. Image matrix processor for fast multi-dimensional computations

    Science.gov (United States)

    Roberson, George P.; Skeate, Michael F.

    1996-01-01

    An apparatus for multi-dimensional computation which comprises a computation engine, including a plurality of processing modules. The processing modules are configured in parallel and compute respective contributions to a computed multi-dimensional image of respective two dimensional data sets. A high-speed, parallel access storage system is provided which stores the multi-dimensional data sets, and a switching circuit routes the data among the processing modules in the computation engine and the storage system. A data acquisition port receives the two dimensional data sets representing projections through an image, for reconstruction algorithms such as encountered in computerized tomography. The processing modules include a programmable local host, by which they may be configured to execute a plurality of different types of multi-dimensional algorithms. The processing modules thus include an image manipulation processor, which includes a source cache, a target cache, a coefficient table, and control software for executing image transformation routines using data in the source cache and the coefficient table and loading resulting data in the target cache. The local host processor operates to load the source cache with a two dimensional data set, loads the coefficient table, and transfers resulting data out of the target cache to the storage system, or to another destination.

  15. Fast algorithms for coordinate processors in Galois field for multiplicity t = 4.5 and t > 5

    International Nuclear Information System (INIS)

    Nikityuk, N.M.

    1989-01-01

    Fast algorithms for solving the coordinate equations for special-purpose processors at multiplicity t = 4.5 and t > 5 are described. Block diagrams of coordinate processor for t 4 in Galois field GF(2 m ) is presented which is solved by a table method. Economical algorithms for solving the coordinate equations by serial methods at t > 5 are described. The algorithms and devices proposed could be applied when creating fast processors in high energy physics spectrometers. 9 refs.; 3 figs

  16. A general-purpose trigger processor system and its application to fast vertex trigger

    International Nuclear Information System (INIS)

    Hazumi, M.; Banas, E.; Natkaniec, Z.; Ostrowicz, W.

    1997-12-01

    A general-purpose hardware trigger system has been developed. The system comprises programmable trigger processors and pattern generator/samplers. The hardware design of the system is described. An application as a prototype of the very fast vertex trigger in an asymmetric B-factory at KEK is also explained. (author)

  17. A fast filter processor as a part of the trigger logic in an elastic scattering experiment

    International Nuclear Information System (INIS)

    Kenyon Gjerpe, I.

    1981-01-01

    A fast special purpose processor as a part of the trigger logic in an elastic scattering experiment is described. The decision to incorporate such a processor was taken because the trigger rate was estimated to be an order of magnitude higher than the date taking capability of the on-line minicomputer, a NORD 10. The processor is capable of checking the coplanarity and the opening angle of the two outgoing tracks within about 100 μs. This is done with a spatial resolution of 1 mm by using two points each track given by 3 MWPCs. For comparison this is two orders of magnitude faster than the same algorithm coded in assembly language on a PDP 11/40. The main contribution to this increased speed is due to extensive use of pipelining and parallelism. When running with the processor in the trigger, 75% more elastic events per incoming beam particle were collected, and 3 times as many elastic events per trigger were recorded on to tape for further in-depth analysis, than previously. Due to major improvements in the primary trigger logic this was less than the gain initially anticipated. A first version of the processor was designed and constructed in the CERN DD division by J. Joosten, M. Letheren and B. Martin under the supervision of C. Verkerk. The author was involved in the final design, construction and testing, and subsequently was responsible for the intergration, programming and running of the processor in the experiment. (orig.)

  18. Multicore technology architecture, reconfiguration, and modeling

    CERN Document Server

    Qadri, Muhammad Yasir

    2013-01-01

    The saturation of design complexity and clock frequencies for single-core processors has resulted in the emergence of multicore architectures as an alternative design paradigm. Nowadays, multicore/multithreaded computing systems are not only a de-facto standard for high-end applications, they are also gaining popularity in the field of embedded computing. The start of the multicore era has altered the concepts relating to almost all of the areas of computer architecture design, including core design, memory management, thread scheduling, application support, inter-processor communication, debu

  19. A Fast DCT Algorithm for Watermarking in Digital Signal Processor

    Directory of Open Access Journals (Sweden)

    S. E. Tsai

    2017-01-01

    Full Text Available Discrete cosine transform (DCT has been an international standard in Joint Photographic Experts Group (JPEG format to reduce the blocking effect in digital image compression. This paper proposes a fast discrete cosine transform (FDCT algorithm that utilizes the energy compactness and matrix sparseness properties in frequency domain to achieve higher computation performance. For a JPEG image of 8×8 block size in spatial domain, the algorithm decomposes the two-dimensional (2D DCT into one pair of one-dimensional (1D DCTs with transform computation in only 24 multiplications. The 2D spatial data is a linear combination of the base image obtained by the outer product of the column and row vectors of cosine functions so that inverse DCT is as efficient. Implementation of the FDCT algorithm shows that embedding a watermark image of 32 × 32 block pixel size in a 256 × 256 digital image can be completed in only 0.24 seconds and the extraction of watermark by inverse transform is within 0.21 seconds. The proposed FDCT algorithm is shown more efficient than many previous works in computation.

  20. XOP: A second generation fast processor for on-line use in high energy physics experiments

    International Nuclear Information System (INIS)

    Lingjaerde, T.

    1981-01-01

    Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fas execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor 'XOP' is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data files, a powerful data addressing unit for easy handling of vectors, as well as single operands, and a very flexible logic for conditional branching. Input/output will be made transparent through the introduction of internal fast processors which will be used in conjunction with powerful firmware as a software debugging aid. (orig.)

  1. TMS320C25 Digital Signal Processor For 2-Dimensional Fast Fourier Transform Computation

    International Nuclear Information System (INIS)

    Ardisasmita, M. Syamsa

    1996-01-01

    The Fourier transform is one of the most important mathematical tool in signal processing and analysis, which converts information from the time/spatial domain into the frequency domain. Even with implementation of the Fast Fourier Transform algorithms in imaging data, the discrete Fourier transform execution consume a lot of time. Digital signal processors are designed specifically to perform computation intensive digital signal processing algorithms. By taking advantage of the advanced architecture. parallel processing, and dedicated digital signal processing (DSP) instruction sets. This device can execute million of DSP operations per second. The device architecture, characteristics and feature suitable for fast Fourier transform application and speed-up are discussed

  2. CMS multicore scheduling strategy

    International Nuclear Information System (INIS)

    Yzquierdo, Antonio Pérez-Calero; Hernández, Jose; Holzman, Burt; Majewski, Krista; McCrea, Alison

    2014-01-01

    In the next years, processor architectures based on much larger numbers of cores will be most likely the model to continue 'Moore's Law' style throughput gains. This not only results in many more jobs in parallel running the LHC Run 1 era monolithic applications, but also the memory requirements of these processes push the workernode architectures to the limit. One solution is parallelizing the application itself, through forking and memory sharing or through threaded frameworks. CMS is following all of these approaches and has a comprehensive strategy to schedule multicore jobs on the GRID based on the glideinWMS submission infrastructure. The main component of the scheduling strategy, a pilot-based model with dynamic partitioning of resources that allows the transition to multicore or whole-node scheduling without disallowing the use of single-core jobs, is described. This contribution also presents the experiences made with the proposed multicore scheduling schema and gives an outlook of further developments working towards the restart of the LHC in 2015.

  3. SAD PROCESSOR FOR MULTIPLE MACROBLOCK MATCHING IN FAST SEARCH VIDEO MOTION ESTIMATION

    Directory of Open Access Journals (Sweden)

    Nehal N. Shah

    2015-02-01

    Full Text Available Motion estimation is a very important but computationally complex task in video coding. Process of determining motion vectors based on the temporal correlation of consecutive frame is used for video compression. In order to reduce the computational complexity of motion estimation and maintain the quality of encoding during motion compensation, different fast search techniques are available. These block based motion estimation algorithms use the sum of absolute difference (SAD between corresponding macroblock in current frame and all the candidate macroblocks in the reference frame to identify best match. Existing implementations can perform SAD between two blocks using sequential or pipeline approach but performing multi operand SAD in single clock cycle with optimized recourses is state of art. In this paper various parallel architectures for computation of the fixed block size SAD is evaluated and fast parallel SAD architecture is proposed with optimized resources. Further SAD processor is described with 9 processing elements which can be configured for any existing fast search block matching algorithm. Proposed SAD processor consumes 7% fewer adders compared to existing implementation for one processing elements. Using nine PE it can process 84 HD frames per second in worse case which is good outcome for real time implementation. In average case architecture process 325 HD frames per second.

  4. Fast Optimal Replica Placement with Exhaustive Search Using Dynamically Reconfigurable Processor

    Directory of Open Access Journals (Sweden)

    Hidetoshi Takeshita

    2011-01-01

    Full Text Available This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN. It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.

  5. A fast continuous magnetic field measurement system based on digital signal processors

    Energy Technology Data Exchange (ETDEWEB)

    Velev, G.V.; Carcagno, R.; DiMarco, J.; Kotelnikov, S.; Lamm, M.; Makulski, A.; /Fermilab; Maroussov, V.; /Purdue U.; Nehring, R.; Nogiec, J.; Orris, D.; /Fermilab; Poukhov,; Prakoshyn, F.; /Dubna, JINR; Schlabach, P.; Tompkins, J.C.; /Fermilab

    2005-09-01

    In order to study dynamic effects in accelerator magnets, such as the decay of the magnetic field during the dwell at injection and the rapid so-called ''snapback'' during the first few seconds of the resumption of the energy ramp, a fast continuous harmonics measurement system was required. A new magnetic field measurement system, based on the use of digital signal processors (DSP) and Analog to Digital (A/D) converters, was developed and prototyped at Fermilab. This system uses Pentek 6102 16 bit A/D converters and the Pentek 4288 DSP board with the SHARC ADSP-2106 family digital signal processor. It was designed to acquire multiple channels of data with a wide dynamic range of input signals, which are typically generated by a rotating coil probe. Data acquisition is performed under a RTOS, whereas processing and visualization are performed under a host computer. Firmware code was developed for the DSP to perform fast continuous readout of the A/D FIFO memory and integration over specified intervals, synchronized to the probe's rotation in the magnetic field. C, C++ and Java code was written to control the data acquisition devices and to process a continuous stream of data. The paper summarizes the characteristics of the system and presents the results of initial tests and measurements.

  6. A fast continuous magnetic field measurement system based on digital signal processors

    International Nuclear Information System (INIS)

    Velev, G.V.; Carcagno, R.; DiMarco, J.; Kotelnikov, S.; Lamm, M.; Makulski, A.; Maroussov, V.; Nehring, R.; Nogiec, J.; Orris, D.; Poukhov, O.; Prakoshyn, F.; Schlabach, P.; Tompkins, J.C.

    2005-01-01

    In order to study dynamic effects in accelerator magnets, such as the decay of the magnetic field during the dwell at injection and the rapid so-called ''snapback'' during the first few seconds of the resumption of the energy ramp, a fast continuous harmonics measurement system was required. A new magnetic field measurement system, based on the use of digital signal processors (DSP) and Analog to Digital (A/D) converters, was developed and prototyped at Fermilab. This system uses Pentek 6102 16 bit A/D converters and the Pentek 4288 DSP board with the SHARC ADSP-2106 family digital signal processor. It was designed to acquire multiple channels of data with a wide dynamic range of input signals, which are typically generated by a rotating coil probe. Data acquisition is performed under a RTOS, whereas processing and visualization are performed under a host computer. Firmware code was developed for the DSP to perform fast continuous readout of the A/D FIFO memory and integration over specified intervals, synchronized to the probe's rotation in the magnetic field. C, C++ and Java code was written to control the data acquisition devices and to process a continuous stream of data. The paper summarizes the characteristics of the system and presents the results of initial tests and measurements

  7. High-Level Design for Ultra-Fast Software Defined Radio Prototyping on Multi-Processors Heterogeneous Platforms

    OpenAIRE

    Moy , Christophe; Raulet , Mickaël

    2010-01-01

    International audience; The design of Software Defined Radio (SDR) equipments (terminals, base stations, etc.) is still very challenging. We propose here a design methodology for ultra-fast prototyping on heterogeneous platforms made of GPPs (General Purpose Processors), DSPs (Digital Signal Processors) and FPGAs (Field Programmable Gate Array). Lying on a component-based approach, the methodology mainly aims at automating as much as possible the design from an algorithmic validation to a mul...

  8. Associative Memory Design for the Fast TracKer Processor (FTK)at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Beretta, M; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Hoff, J; Liu, T; Liberali, V; Sacco, I; Schoening, A; Soltveit, H K; Stabile, A; Tripiccione, R

    2011-01-01

    We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. A large CAM bank stores all trajectories of interest and extracts the ones compatible with a given event. This task is naturally parallelized by a CAM architecture able to output identified trajectories, recognized among a huge amount of possible combinations, in just a few 100 MHz clock cycles. We have developed this device (called the AMchip03 processor), using 180 nm technology, for the Silicon Vertex Trigger (SVT) upgrade at CDF [1] using a standard-cell VLSI design methodology. We propose a new design that introduces a full custom CAM cell and takes advantage of 65 nm technology. The customized design maximizes the pattern density, minimizes the power consumption and implements the functionalities needed for the planned Fast Tracker (FTK) [2], an ATLAS trigger upgrade project at LHC. We introduce a new variable resolution patt...

  9. Advanced control system for the Integral Fast Reactor fuel pin processor

    International Nuclear Information System (INIS)

    Lau, L.D.; Randall, P.F.; Benedict, R.W.; Levinskas, D.

    1993-01-01

    A computerized control system has been developed for the remotely-operated fuel pin processor used in the Integral Fast Reactor Program, Fuel Cycle Facility (FCF). The pin processor remotely shears cast EBR- reactor fuel pins to length, inspects them for diameter, straightness, length, and weight, and then inserts acceptable pins into new sodium-loaded stainless-steel fuel element jackets. Two main components comprise the control system: (1) a programmable logic controller (PLC), together with various input/output modules and associated relay ladder-logic associated computer software. The PLC system controls the remote operation of the machine as directed by the OCS, and also monitors the machine operation to make operational data available to the OCS. The OCS allows operator control of the machine, provides nearly real-time viewing of the operational data, allows on-line changes of machine operational parameters, and records the collected data for each acceptable pin on a central data archiving computer. The two main components of the control system provide the operator with various levels of control ranging from manual operation to completely automatic operation by means of a graphic touch screen interface

  10. Performance of the AMBFTK board for the FastTracker processor for the ATLAS detector upgrade

    International Nuclear Information System (INIS)

    Alberti, F; Citterio, M; Liberali, V; Meroni, C; Andreani, A; Stabile, A; Annovi, A; Beretta, M; Crescioli, F; Dell'Orso, M; Piendibene, M; Volpi, G; Giannetti, P; Lanza, A; Magalotti, D; Sacco, I

    2013-01-01

    Modern experiments at hadron colliders search for extremely rare processes hidden in a very large background. As the experiment complexity and the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. The FastTracker (FTK) processor for the ATLAS experiment offers extremely powerful, very compact and low power consumption processing units for the future, which is essential for increased efficiency and purity in the Level 2 trigger selection through the intensive use of tracking. Pattern recognition is performed with Associative Memories (AM). The AMBFTK board and the AMchip04 integrated circuit have been designed specifically for this purpose. We report on the preliminary test results of the first prototypes of the AMBFTK board and of the AMchip04.

  11. XOP, a fast versatile processor, as a building block for parallel processing in high energy physics experiments

    International Nuclear Information System (INIS)

    Baehler, P.; Bosco, N.; Lingjaerde, T.; Ljuslin, C.; Van Praag, A.; Werner, P.

    1986-01-01

    The XOP processor has been designed for trigger calculation and data compression in high energy physics experiments. Therefore, emphasis has been placed upon fast execution and high input/output rate. The fast execution is achieved by a wide instruction word holding operations which are executed concurrently. Thus, the arithmetic operations, data address calculations, data accessing, condition checking, loop count checking and next instruction evaluation all overlap in time. In conventional micro-processors these operations are performed sequentially. In addition, the instruction set comprises not only the classical computer instructions, but also specialized instructions suitable for trigger calculations, such as bit search, population count, loose compare and vector instructions. In order to achieve a high input/output rate, each XOP ECLine interface board is equipped with an input and an output port which fulfil the LeCroy ECLine specifications. The autonomous input port allows a data rate of 40 Mbytes/sec, while the program controlled output port allows 20 Mbytes/sec. For Fastbus based systems a dual Fastbus master interface is under design which allows to build up a Fastbus multi-processor system. This design is being done in collaboration with LAPP Annecy for the CERN Lep L3 experiment. Their scheme comprises 4-5 XOP processors, each of them with a master interface on a data input segment and a master interface on a data output segment. This paper describes the structure of the XOP processor, the interface capabilities and the software development and debugging tools. (Auth.)

  12. CQPSO scheduling algorithm for heterogeneous multi-core DAG task model

    Science.gov (United States)

    Zhai, Wenzheng; Hu, Yue-Li; Ran, Feng

    2017-07-01

    Efficient task scheduling is critical to achieve high performance in a heterogeneous multi-core computing environment. The paper focuses on the heterogeneous multi-core directed acyclic graph (DAG) task model and proposes a novel task scheduling method based on an improved chaotic quantum-behaved particle swarm optimization (CQPSO) algorithm. A task priority scheduling list was built. A processor with minimum cumulative earliest finish time (EFT) was acted as the object of the first task assignment. The task precedence relationships were satisfied and the total execution time of all tasks was minimized. The experimental results show that the proposed algorithm has the advantage of optimization abilities, simple and feasible, fast convergence, and can be applied to the task scheduling optimization for other heterogeneous and distributed environment.

  13. The Associative Memory Serial Link Processor of the ALTAS Fast TracKer Processing System

    CERN Document Server

    Sotiropoulou, Calliope Louisa; The ATLAS collaboration

    2017-01-01

    The upgraded Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment at the LHC will improve the capability of the detector to select the events with the greatest scientific potential. The Fast TracKer (FTK) is one of the ATLAS TDAQ upgrades that is presently under commissioning. FTK is a custom hardware system that feeds the High Level Trigger (HLT) with charged particle tracks reconstructed from hits in silicon detectors at the rate of 105 events per second. The main processing element of FTK is the Associative Memory (AM) system that is used to perform pattern matching with a high degree of parallelism. Its implementation is called the AM Board Serial Link Processor (AMBSLP) and it is a very efficient pattern matching machine that handles in parallel massive data samples. The AMBSLP consists of two types of boards: the Little Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME motherboard that hosts four LAMB daughter-boar...

  14. The Associative Memory Serial Link Processor of the ALTAS Fast TracKer Processing System

    CERN Document Server

    Sotiropoulou, Calliope Louisa; The ATLAS collaboration

    2017-01-01

    The upgraded trigger system of the ATLAS experiment at LHC will improve the capability of the detectors to select the events with the greatest scientific potential. The FastTracker (FTK) is one of the ATLAS trigger upgrade that is presently under commissioning. FTK is a hardware system that feeds the High Level Trigger with charged particle tracks reconstructed from hits in silicon detectors at the rate of 105 events per second. Once a track candidate is found, the track reconstruction proceeds by matching low resolution hits to predefined patterns. Selected hits matching the predefined pattern are used in a second processing step for a more precise track fitting algorithm. The main processing element of FTK is the Associative Memory (AM) system that is used to perform pattern matching with high degree of parallelism. Its implementation is called the AM Board Serial Link Processor (AMBSLP) and it is a very efficient pattern matching machine that handles massively parallel data. The AMB SLP consists of two typ...

  15. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    Science.gov (United States)

    Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  16. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    Science.gov (United States)

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  17. Summary of multi-core hardware and programming model investigations

    Energy Technology Data Exchange (ETDEWEB)

    Kelly, Suzanne Marie; Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.

    2008-05-01

    This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community.

  18. The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

    CERN Document Server

    Annovi, A; Bevacqua, V; Cervigni, F; Crescioli, F; Fabbri, L; Giannetti, P; Giorgi, F; Magalotti, D; Negri, A; Piendibene, M; Roda, C; Sbarra, C; Villa, M; Vitillo, RA; Volpi, G

    2012-01-01

    The FastTracKer (FTK), a hardware dedicated processor, performs fast and precise online full track reconstruction at the ATLAS experiment, within an average latency of few dozens of microseconds. \\ It is made of two pipelined processors, the Associative Memory finding low precision tracks, and the Track Fitter refining the track quality with high precision fits. FTK has to face the Large Hadron Collider (LHC) Phase I luminosity. So, while the new processor requires the best of the available technology for tracking in high occupancy conditions, we want to use already existing prototypes to exercise soon the FTK functions in the new ATLAS environment. Few boards connected together constitute a "baby FTK" that will grow soon becoming the "vertical slice".\\ The vertical slice will cover a small projective wedge in the detector, but it will be functionally complete. It will provide a full test of the entire FTK data chain, in the laboratory first and on beam-on conditions after. It will require early development a...

  19. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    Directory of Open Access Journals (Sweden)

    Kaufmann Michael

    2004-09-01

    Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

  20. The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Villa, M; Bevacqua, V; Vitillo, R A; Giorgi, F; Magalotti, D; Roda, C; Cervigni, F; Giannetti, P; Negri, A; Piendibene, M; Volpi, G; Fabbri, L; Sbarra, C

    2011-01-01

    The FastTracKer (FTK) is a dedicated hardware system able to perform online fast and precise track reconstruction of the full events at the Atlas experiment, within an average latency of few dozens of microseconds. It is made of two pipelined processors: the Associative Memory (AM), finding low precision tracks called "roads", and the Track Fitter (TF), refining the track quality with high precision fits. The FTK design [1] that works well at the Large Hadron Collider (LHC) Phase I luminosity requires the best of the available technology for tracking in high occupancy conditions. While the new processor is designed for the most demanding LHC conditions, we want to use already existing prototypes, part of them developed for the SLIM5 collaboration [2], to exercise the FTK functions in the new Atlas environment. During Laboratory tests, the EDRO board (Event Dispatch and Read-Out) receives on a clustering mezzanine (able to calculate the pixel and SCT cluster centroids) "fake" detector raw data on S-links from ...

  1. A highly efficient multi-core algorithm for clustering extremely large datasets

    Directory of Open Access Journals (Sweden)

    Kraus Johann M

    2010-04-01

    Full Text Available Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

  2. Case for a field-programmable gate array multicore hybrid machine for an image-processing application

    Science.gov (United States)

    Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

    2011-01-01

    General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.

  3. Fast track-finding trigger processor for the SLAC/LBL Mark II Detector

    International Nuclear Information System (INIS)

    Brafman, H.; Breidenbach, M.; Hettel, R.; Himel, T.; Horelick, D.

    1977-10-01

    The SLAC/LBL Mark II Magnetic Detector consists of various particle detectors arranged in cylindrical symmetry located in and around an axial magnetic field. A versatile, programmable secondary trigger processor was designed and built to find curved tracks in the detector. The system operates at a 10 MHz clock rate with a total processing time of 34 μsec and is used to ''trigger'' the data processing computer, thereby rejecting background and greatly improving the data acquisition aspects of the detector-computer combination

  4. The Serial Link Processor for the Fast TracKer (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Biesuz, N; Billereau, W; Combe, J M; Citterio, M; Citraro, S; Crescioli, F; Dimas, D; Donati, S; Gentsos, C; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Rossi, E; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A; Vulliez, P

    2014-01-01

    The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problem inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report also on the performance of a first prototype based on the use of a min@sic AM chip, a small but complete version ...

  5. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. ...

  6. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. T...

  7. Design and development of microblaze processor based Remote Terminal Units for Fast Breeder Reactors

    International Nuclear Information System (INIS)

    Gour, Aditya; Santhanaraj, A.; Behera, R.P.; Murali, N.; Satyamurty, S.A.V.

    2013-01-01

    Remote Terminal Units (RTUs) are single board remote data acquisition and control systems that are widely used in FBRs during all states of plant operation. Distributed Digital Control System (DDCS) architecture is being followed for the plant control and operation, which mandates the need for multiple sockets support in TCPIP Ethernet communication in an embedded system. Existing RTUs are 89C51 microcontroller based systems where the TCPIP communication is done using Wiznet Module. These modules can support maximum of four sockets and are already obsolete from the market. In this paper a new RTU design is described where the complete digital logic of a board is implemented in one single FPGA device using Soft-core processor and EMAC controller with multiple socket support for the Ethernet communication. This makes design more reliable and immune to obsolescence. (author)

  8. Target Localization by Resolving the Time Synchronization Problem in Bistatic Radar Systems Using Space Fast-Time Adaptive Processor

    Directory of Open Access Journals (Sweden)

    D. Madurasinghe

    2009-01-01

    Full Text Available The proposed technique allows the radar receiver to accurately estimate the range of a large number of targets using a transmitter of opportunity as long as the location of the transmitter is known. The technique does not depend on the use of communication satellites or GPS systems, instead it relies on the availability of the direct transmit copy of the signal from the transmitter and the reflected paths off the various targets. An array-based space-fast time adaptive processor is implemented in order to estimate the path difference between the direct signal and the delayed signal, which bounces off the target. This procedure allows us to estimate the target distance as well as bearing.

  9. Silicon photonics for multicore fiber communication

    DEFF Research Database (Denmark)

    Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld

    2016-01-01

    We review our recent work on silicon photonics for multicore fiber communication, including multicore fiber fan-in/fan-out, multicore fiber switches towards reconfigurable optical add/drop multiplexers. We also present multicore fiber based quantum communication using silicon devices.......We review our recent work on silicon photonics for multicore fiber communication, including multicore fiber fan-in/fan-out, multicore fiber switches towards reconfigurable optical add/drop multiplexers. We also present multicore fiber based quantum communication using silicon devices....

  10. Evaluating Multicore Algorithms on the Unified Memory Model

    Directory of Open Access Journals (Sweden)

    John E. Savage

    2009-01-01

    Full Text Available One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores. It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.

  11. Evolution of CMS Workload Management Towards Multicore Job Support

    Energy Technology Data Exchange (ETDEWEB)

    Perez-Calero Yzquierdo, A. [Madrid, CIEMAT; Hernández, J. M. [Madrid, CIEMAT; Khan, F. A. [Quaid-i-Azam U.; Letts, J. [UC, San Diego; Majewski, K. [Fermilab; Rodrigues, A. M. [Fermilab; McCrea, A. [UC, San Diego; Vaandering, E. [Fermilab

    2015-12-23

    The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.

  12. Parallel optimization of IDW interpolation algorithm on multicore platform

    Science.gov (United States)

    Guan, Xuefeng; Wu, Huayi

    2009-10-01

    Due to increasing power consumption, heat dissipation, and other physical issues, the architecture of central processing unit (CPU) has been turning to multicore rapidly in recent years. Multicore processor is packaged with multiple processor cores in the same chip, which not only offers increased performance, but also presents significant challenges to application developers. As a matter of fact, in GIS field most of current GIS algorithms were implemented serially and could not best exploit the parallelism potential on such multicore platforms. In this paper, we choose Inverse Distance Weighted spatial interpolation algorithm (IDW) as an example to study how to optimize current serial GIS algorithms on multicore platform in order to maximize performance speedup. With the help of OpenMP, threading methodology is introduced to split and share the whole interpolation work among processor cores. After parallel optimization, execution time of interpolation algorithm is greatly reduced and good performance speedup is achieved. For example, performance speedup on Intel Xeon 5310 is 1.943 with 2 execution threads and 3.695 with 4 execution threads respectively. An additional output comparison between pre-optimization and post-optimization is carried out and shows that parallel optimization does to affect final interpolation result.

  13. Low pain vs no pain multicore Haskells

    DEFF Research Database (Denmark)

    Aswad, Mustafa; Trinder, Phil; Al Zain, Abdallah

    2010-01-01

    the programming effort eachlanguage requires with the parallel performance delivered. The study uses 15’typical’ programs to compare a ’no pain’, i.e. entirely implicit, parallel languagewith three ’low pain’, i.e. semi-explicit languages.The parallel Haskell implementations use different versions of GHC......Multicores are becoming the dominant processor technology andfunctional languages are theoretically well suited to exploit them. In practice,however, implementing effective high level parallel functional languages is extremelychallenging.This paper is the first programming and performance...

  14. Using Multi-Core Systems for Rover Autonomy

    Science.gov (United States)

    Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.

    2010-01-01

    Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality

  15. Multicore systems on-chip practical software/hardware design

    CERN Document Server

    Abdallah, Abderazek Ben

    2013-01-01

    System on chips designs have evolved from fairly simple unicore, single memory designs to complex heterogeneous multicore SoC architectures consisting of a large number of IP blocks on the same silicon. To meet high computational demands posed by latest consumer electronic devices, most current systems are based on such paradigm, which represents a real revolution in many aspects in computing.The attraction of multicore processing for power reduction is compelling. By splitting a set of tasks among multiple processor cores, the operating frequency necessary for each core can be reduced, allowi

  16. The Associative Memory Serial Link Processor for the Fast TracKer (FTK) at ATLAS

    International Nuclear Information System (INIS)

    Andreani, A; Citterio, M; Liberali, V; Annovi, A; Beretta, M; Beccherle, R; Crescioli, F; Biesuz, N; Billereau, W; Combe, J M; Cipriani, R; Citraro, S; Donati, S; Giannetti, P; Luciano, P; Colombo, A; Dimas, D; Gentsos, C; Kordas, K; Lanza, A

    2014-01-01

    The Fast TracKer (FTK) is an extremely powerful and very compact processing unit, essential for efficient Level 2 trigger selection in future high-energy physics experiments at the LHC. FTK employs Associative Memories (AM) to perform pattern recognition; input and output data are transmitted over serial links at 2 Gbit/s to reduce routing congestion at the board level. Prototypes of the AM chip and of the AM board have been manufactured and tested, in preparation of the imminent design of the final version

  17. The Fast Tracker Real Time Processor: high quality real-time tracking at ATLAS

    CERN Document Server

    Stabile, A; The ATLAS collaboration

    2011-01-01

    As the LHC luminosity is ramped up to the design level of 1x1034 cm−2 s−1 and beyond, the high rates, multiplicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the most important physics and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK)[1], [2] is a proposed upgrade to the current ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK is a dedicated Super Computer based on a mixture of advanced technologies. The architecture broadly employs powerf...

  18. A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor

    International Nuclear Information System (INIS)

    Sotiropoulou, C L; Gkaitatzis, S; Kordas, K; Nikolaidis, S; Petridou, C; Annovi, A; Beretta, M; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of

  19. Exploring performance and power properties of modern multicore chips via simple machine models

    OpenAIRE

    Hager, Georg; Treibig, Jan; Habich, Johannes; Wellein, Gerhard

    2012-01-01

    Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the wel...

  20. High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

    Directory of Open Access Journals (Sweden)

    H. Y. Su

    2012-04-01

    Full Text Available This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms.

  1. CMS readiness for multi-core workload scheduling

    Science.gov (United States)

    Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

    2017-10-01

    In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.

  2. CMS Readiness for Multi-Core Workload Scheduling

    Energy Technology Data Exchange (ETDEWEB)

    Perez-Calero Yzquierdo, A. [Madrid, CIEMAT; Balcas, J. [Caltech; Hernandez, J. [Madrid, CIEMAT; Aftab Khan, F. [NCP, Islamabad; Letts, J. [UC, San Diego; Mason, D. [Fermilab; Verguilov, V. [CLMI, Sofia

    2017-11-22

    In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.

  3. Multicore Challenges and Benefits for High Performance Scientific Computing

    Directory of Open Access Journals (Sweden)

    Ida M.B. Nielsen

    2008-01-01

    Full Text Available Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.

  4. High-density multicore fibers

    DEFF Research Database (Denmark)

    Takenaga, K.; Matsuo, S.; Saitoh, K.

    2016-01-01

    High-density single-mode multicore fibers were designed and fabricated. A heterogeneous 30-core fiber realized a low crosstalk of −55 dB. A quasi-single-mode homogeneous 31-core fiber attained the highest core count as a single-mode multicore fiber.......High-density single-mode multicore fibers were designed and fabricated. A heterogeneous 30-core fiber realized a low crosstalk of −55 dB. A quasi-single-mode homogeneous 31-core fiber attained the highest core count as a single-mode multicore fiber....

  5. Fuzzy logic based power-efficient real-time multi-core system

    CERN Document Server

    Ahmed, Jameel; Najam, Shaheryar; Najam, Zohaib

    2017-01-01

    This book focuses on identifying the performance challenges involved in computer architectures, optimal configuration settings and analysing their impact on the performance of multi-core architectures. Proposing a power and throughput-aware fuzzy-logic-based reconfiguration for Multi-Processor Systems on Chip (MPSoCs) in both simulation and real-time environments, it is divided into two major parts. The first part deals with the simulation-based power and throughput-aware fuzzy logic reconfiguration for multi-core architectures, presenting the results of a detailed analysis on the factors impacting the power consumption and performance of MPSoCs. In turn, the second part highlights the real-time implementation of fuzzy-logic-based power-efficient reconfigurable multi-core architectures for Intel and Leone3 processors. .

  6. Real-Time Audio Processing on the T-CREST Multicore Platform

    DEFF Research Database (Denmark)

    Ausin, Daniel Sanz; Pezzarossa, Luca; Schoeberl, Martin

    2017-01-01

    of the audio signal. This paper presents a real-time multicore audio processing system based on the T-CREST platform. T-CREST is a time-predictable multicore processor for real-time embedded systems. Multiple audio effect tasks have been implemented, which can be connected together in different configurations...... forming sequential and parallel effect chains, and using a network-onchip for intercommunication between processors. The evaluation of the system shows that real-time processing of multiple effect configurations is possible, and that the estimation and control of latency ensures real-time behavior.......Multicore platforms are nowadays widely used for audio processing applications, due to the improvement of computational power that they provide. However, some of these systems are not optimized for temporally constrained environments, which often leads to an undesired increase in the latency...

  7. Energy Efficient Multi-Core Processing

    Directory of Open Access Journals (Sweden)

    Charles Leech

    2014-06-01

    Full Text Available This paper evaluates the present state of the art of energy-efficient embedded processor design techniques and demonstrates, how small, variable-architecture embedded processors may exploit a run-time minimal architectural synthesis technique to achieve greater energy and area efficiency whilst maintaining performance. The picoMIPS architecture is presented, inspired by the MIPS, as an example of a minimal and energy efficient processor. The picoMIPS is a variablearchitecture RISC microprocessor with an application-specific minimised instruction set. Each implementation will contain only the necessary datapath elements in order to maximise area efficiency. Due to the relationship between logic gate count and power consumption, energy efficiency is also maximised in the processor therefore the system is designed to perform a specific task in the most efficient processor-based form. The principles of the picoMIPS processor are illustrated with an example of the discrete cosine transform (DCT and inverse DCT (IDCT algorithms implemented in a multi-core context to demonstrate the concept of minimal architecture synthesis and how it can be used to produce an application specific, energy efficient processor.

  8. Using the automata processor for fast pattern recognition in high energy physics experiments—A proof of concept

    Energy Technology Data Exchange (ETDEWEB)

    Wang, Michael H.L.S., E-mail: mwang@fnal.gov [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States); Cancelo, Gustavo; Green, Christopher [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States); Guo, Deyuan; Wang, Ke [University of Virginia, Charlottesville, VA 22904 (United States); Zmuda, Ted [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States)

    2016-10-01

    We explore the Micron Automata Processor (AP) as a suitable commodity technology that can address the growing computational needs of pattern recognition in High Energy Physics (HEP) experiments. A toy detector model is developed for which an electron track confirmation trigger based on the Micron AP serves as a test case. Although primarily meant for high speed text-based searches, we demonstrate a proof of concept for the use of the Micron AP in a HEP trigger application.

  9. Optimization of multi-phase compressible lattice Boltzmann codes on massively parallel multi-core systems

    NARCIS (Netherlands)

    Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Schifano, S.F.; Toschi, F.; Tripiccione, R.

    2011-01-01

    We develop a Lattice Boltzmann code for computational fluid-dynamics and optimize it for massively parallel systems based on multi-core processors. Our code describes 2D multi-phase compressible flows. We analyze the performance bottlenecks that we find as we gradually expose a larger fraction of

  10. Multicore Programming Challenges

    Science.gov (United States)

    Perrone, Michael

    The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.

  11. Concurrent Programming in Mac OS X and iOS Unleash Multicore Performance with Grand Central Dispatch

    CERN Document Server

    Nahavandipoor, Vandad

    2011-01-01

    Now that multicore processors are coming to mobile devices, wouldn't it be great to take advantage of all those cores without having to manage threads? This concise book shows you how to use Apple's Grand Central Dispatch (GCD) to simplify programming on multicore iOS devices and Mac OS X. Managing your application's resources on more than one core isn't easy, but it's vital. Apps that use only one core in a multicore environment will slow to a crawl. If you know how to program with Cocoa or Cocoa Touch, this guide will get you started with GCD right away, with many examples to help you writ

  12. Smart Multicore Embedded Systems

    DEFF Research Database (Denmark)

    This book provides a single-source reference to the state-of-the-art of high-level programming models and compilation tool-chains for embedded system platforms. The authors address challenges faced by programmers developing software to implement parallel applications in embedded systems, where very...... specificities of various embedded systems from different industries. Parallel programming tool-chains are described that take as input parameters both the application and the platform model, then determine relevant transformations and mapping decisions on the concrete platform, minimizing user intervention...... and hiding the difficulties related to the correct and efficient use of memory hierarchy and low level code generation. Describes tools and programming models for multicore embedded systems Emphasizes throughout performance per watt scalability Discusses realistic limits of software parallelization Enables...

  13. Multicore Education through Simulation

    Science.gov (United States)

    Ozturk, O.

    2011-01-01

    A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…

  14. Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

    Science.gov (United States)

    Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori

    Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.

  15. Reducing Competitive Cache Misses in Modern Processor Architectures

    OpenAIRE

    Prisagjanec, Milcho; Mitrevski, Pece

    2017-01-01

    The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. This tec...

  16. Multi-core System Architecture for Safety-critical Control Applications

    DEFF Research Database (Denmark)

    Li, Gang

    and size, and high power consumption. Increasing the frequency of a processor is becoming painful now due to the explosive power consumption. Furthermore, components integrated into a single-core processor have to be certified to the highest SIL, due to that no isolation is provided in a traditional single...... certification cost. Meanwhile, hardware platforms with improved processing power are required to execute the applications of larger size. To tackle the two issues mentioned above, the state of the art approaches are using more Electronic Control Units (ECU) in a federated architecture or increasing......-core processor. A promising alternative to improve processing power and provide isolation is to adopt a multi-core architecture with on-chip isolation. In general, a specific multi-core architecture can facilitate the development and certification of safety-related systems, due to its physical isolation between...

  17. Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

    Science.gov (United States)

    Hadade, Ioan; di Mare, Luca

    2016-08-01

    Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

  18. Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems

    Directory of Open Access Journals (Sweden)

    Zhihui Du

    2017-01-01

    Full Text Available We describe a family of power models that can capture the nonuniform power effects of speed scaling among homogeneous cores on multicore processors. These models depart from traditional ones, which assume that individual cores contribute to power consumption as independent entities. In our approach, we remove this independence assumption and employ statistical variables of core speed (average speed and the dispersion of the core speeds to capture the comprehensive heterogeneous impact of subtle interactions among the underlying hardware. We systematically explore the model family, deriving basic and refined models that give progressively better fits, and analyze them in detail. The proposed methodology provides an easy way to build power models to reflect the realistic workings of current multicore processors more accurately. Moreover, unlike the existing lower-level power models that require knowledge of microarchitectural details of the CPU cores and the last level cache to capture core interdependency, ours are easier to use and scalable to emerging and future multicore architectures with more cores. These attributes make the models particularly useful to system users or algorithm designers who need a quick way to estimate power consumption. We evaluate the family of models on contemporary x86 multicore processors using the SPEC2006 benchmarks. Our best model yields an average predicted error as low as 5%.

  19. Synthetic Aperture Sequential Beamforming implemented on multi-core platforms

    DEFF Research Database (Denmark)

    Kjeldsen, Thomas; Lassen, Lee; Hemmsen, Martin Christian

    2014-01-01

    This paper compares several computational ap- proaches to Synthetic Aperture Sequential Beamforming (SASB) targeting consumer level parallel processors such as multi-core CPUs and GPUs. The proposed implementations demonstrate that ultrasound imaging using SASB can be executed in real- time with ...... per second) on an Intel Core i7 2600 CPU with an AMD HD7850 and a NVIDIA GTX680 GPU. The fastest CPU and GPU implementations use 14% and 1.3% of the real-time budget of 62 ms/frame, respectively. The maximum achieved processing rate is 1265 frames/s....

  20. Improving the throughput of the AES algorithm with multicore processors

    OpenAIRE

    Barnes, A.; Fernando, R.; Mettananda, K.; Ragel, R. G.

    2014-01-01

    AES, Advanced Encryption Standard, can be considered the most widely used modern symmetric key encryption standard. To encrypt/decrypt a file using the AES algorithm, the file must undergo a set of complex computational steps. Therefore a software implementation of AES algorithm would be slow and consume large amount of time to complete. The immense increase of both stored and transferred data in the recent years had made this problem even more daunting when the need to encrypt/decrypt such d...

  1. Decimal Engine for Energy-Efficient Multicore Processors

    DEFF Research Database (Denmark)

    Nannarelli, Alberto

    2014-01-01

    propose a hybrid BFP/DFP engine to perform BFP division and DFP addition, multiplication and division. The main purpose of this engine is to offload the binary floating-point units for this type of operations and reduce the latency for decimal operations, and power and temperature for the whole die....

  2. A polyphase filter for GPUs and multi-core processors

    NARCIS (Netherlands)

    van der Veldt, K.; van Nieuwpoort, R.; Varbanescu, A.L.; Jesshope, C.

    2012-01-01

    Software radio telescopes are a new development in radio astronomy. Rather than using expensive dishes, they form distributed sensor networks of tens of thousands of simple receivers. Signals are processed in software instead of custom-built hardware, taking advantage of the flexibility that

  3. High-Rate Data Input using the Tilera Multicore Processor

    Data.gov (United States)

    National Aeronautics and Space Administration — Solve the existing problem of handling the on-board, real time, memory intensive processing of the Gb/s data stream of the scientific instrument.  Investigate the...

  4. A multicore processor for time-critical applications

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Pezzarossa, Luca; Sparsø, Jens

    2018-01-01

    This article presents T-CREST many-core architecture specially designed and optimized for time-critical systems. —Tulika Mitra, National University of Singapore —Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnber —Lothar Thiele, ETH Zurich.......This article presents T-CREST many-core architecture specially designed and optimized for time-critical systems. —Tulika Mitra, National University of Singapore —Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnber —Lothar Thiele, ETH Zurich....

  5. A high performance load balance strategy for real-time multicore systems.

    Science.gov (United States)

    Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

    2014-01-01

    Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.

  6. A High Performance Load Balance Strategy for Real-Time Multicore Systems

    Directory of Open Access Journals (Sweden)

    Keng-Mao Cho

    2014-01-01

    Full Text Available Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS. Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.

  7. Use of FPGA embedded processors for fast cluster reconstruction in the NA62 liquid krypton electromagnetic calorimeter

    Science.gov (United States)

    Badoni, D.; Bizzarri, M.; Bonaiuto, V.; Checcucci, B.; De Simone, N.; Federici, L.; Fucci, A.; Paoluzzi, G.; Papi, A.; Piccini, M.; Salamon, A.; Salina, G.; Santovetti, E.; Sargeni, F.; Venditti, S.

    2014-01-01

    The goal of the NA62 experiment at the CERN SPS is the measurement of the Branching Ratio of the very rare kaon decay K+→π+ ν bar nu with a 10% accuracy by collecting 100 events in two years of data taking. An efficient photon veto system is needed to reject the K+→π+ π0 background and a liquid krypton electromagnetic calorimeter will be used for this purpose in the 1-10 mrad angular region. The L0 trigger system for the calorimeter consists of a peak reconstruction algorithm implemented on FPGA by using a mixed parallel architecture based on soft core Altera NIOS II embedded processors together with custom VHDL modules. This solution allows an efficient and flexible reconstruction of the energy-deposition peak. The system will be totally composed of 36 TEL62 boards, 108 mezzanine cards and 215 high-performance FPGAs. We describe the design, current status and the results of the first performance tests.

  8. Use of FPGA embedded processors for fast cluster reconstruction in the NA62 liquid krypton electromagnetic calorimeter

    International Nuclear Information System (INIS)

    Badoni, D; Fucci, A; Paoluzzi, G; Salamon, A; Salina, G; Bizzarri, M; Bonaiuto, V; Simone, N De; Federici, L; Sargeni, F; Checcucci, B; Papi, A; Piccini, M; Santovetti, E; Venditti, S

    2014-01-01

    The goal of the NA62 experiment at the CERN SPS is the measurement of the Branching Ratio of the very rare kaon decay K + →π +  ν ν-bar with a 10% accuracy by collecting 100 events in two years of data taking. An efficient photon veto system is needed to reject the K + →π +  π 0 background and a liquid krypton electromagnetic calorimeter will be used for this purpose in the 1-10 mrad angular region. The L0 trigger system for the calorimeter consists of a peak reconstruction algorithm implemented on FPGA by using a mixed parallel architecture based on soft core Altera NIOS II embedded processors together with custom VHDL modules. This solution allows an efficient and flexible reconstruction of the energy-deposition peak. The system will be totally composed of 36 TEL62 boards, 108 mezzanine cards and 215 high-performance FPGAs. We describe the design, current status and the results of the first performance tests

  9. FAST: framework for heterogeneous medical image computing and visualization.

    Science.gov (United States)

    Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank

    2015-11-01

    Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.

  10. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

    KAUST Repository

    Malas, T.

    2015-07-02

    The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.

  11. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

    KAUST Repository

    Malas, T.; Hager, G.; Ltaief, Hatem; Stengel, H.; Wellein, G.; Keyes, David E.

    2015-01-01

    The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.

  12. Dealing with BIG Data - Exploiting the Potential of Multicore Parallelism and Auto-Tuning

    CERN Multimedia

    CERN. Geneva

    2012-01-01

    Physics experiments nowadays produce tremendous amounts of data that require sophisticated analyses in order to gain new insights. At such large scale, scientists are facing non-trivial software engineering problems in addition to the physics problems. Ubiquitous multicore processors and GPGPUs have turned almost any computer into a parallel machine and have pushed compute clusters and clouds to become multicore-based and more heterogenous. These developments complicate the exploitation of various types of parallelism within different layers of hardware and software. As a consequence, manual performance tuning is non-intuitive and tedious due to the large search space spanned by numerous inter-related tuning parameters. This talk addresses these challenges at CERN and discusses how to leverage multicore parallelization techniques in this context. It presents recent advances in automatic performance tuning to algorithmically find sweet spots with good performance. The talk also presents results from empiri...

  13. High-Performance Parallel and Stream Processing of X-ray Microdiffraction Data on Multicores

    International Nuclear Information System (INIS)

    Bauer, Michael A; McIntyre, Stewart; Xie Yuzhen; Biem, Alain; Tamura, Nobumichi

    2012-01-01

    We present the design and implementation of a high-performance system for processing synchrotron X-ray microdiffraction (XRD) data in IBM InfoSphere Streams on multicore processors. We report on the parallel and stream processing techniques that we use to harvest the power of clusters of multicores to analyze hundreds of gigabytes of synchrotron XRD data in order to reveal the microtexture of polycrystalline materials. The timing to process one XRD image using one pipeline is about ten times faster than the best C program at present. With the support of InfoSphere Streams platform, our software is able to be scaled up to operate on clusters of multi-cores for processing multiple images concurrently. This system provides a high-performance processing kernel to achieve near real-time data analysis of image data from synchrotron experiments.

  14. The Fast Tracker real time processor and its impact on the muon isolation, tau & b jet online selections at Atlas

    CERN Document Server

    Crescioli, F; The ATLAS collaboration

    2010-01-01

    As the LHC luminosity is ramped up to the design level of 3x10^{34} cm^{-2} s^{-1} and beyond, the high rates, multipicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the physics we are most interested in, and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK) is a proposed upgrade to the ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK solves the combinatorial challenge inherent to tracking by exploiting the massive parallelism of associative memor...

  15. LDRD final report : a lightweight operating system for multi-core capability class supercomputers.

    Energy Technology Data Exchange (ETDEWEB)

    Kelly, Suzanne Marie; Hudson, Trammell B. (OS Research); Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico); Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.; Brightwell, Ronald Brian

    2010-09-01

    The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.

  16. Fast decision algorithms in low-power embedded processors for quality-of-service based connectivity of mobile sensors in heterogeneous wireless sensor networks.

    Science.gov (United States)

    Jaraíz-Simón, María D; Gómez-Pulido, Juan A; Vega-Rodríguez, Miguel A; Sánchez-Pérez, Juan M

    2012-01-01

    When a mobile wireless sensor is moving along heterogeneous wireless sensor networks, it can be under the coverage of more than one network many times. In these situations, the Vertical Handoff process can happen, where the mobile sensor decides to change its connection from a network to the best network among the available ones according to their quality of service characteristics. A fitness function is used for the handoff decision, being desirable to minimize it. This is an optimization problem which consists of the adjustment of a set of weights for the quality of service. Solving this problem efficiently is relevant to heterogeneous wireless sensor networks in many advanced applications. Numerous works can be found in the literature dealing with the vertical handoff decision, although they all suffer from the same shortfall: a non-comparable efficiency. Therefore, the aim of this work is twofold: first, to develop a fast decision algorithm that explores the entire space of possible combinations of weights, searching that one that minimizes the fitness function; and second, to design and implement a system on chip architecture based on reconfigurable hardware and embedded processors to achieve several goals necessary for competitive mobile terminals: good performance, low power consumption, low economic cost, and small area integration.

  17. Multicore Performance of Block Algebraic Iterative Reconstruction Methods

    DEFF Research Database (Denmark)

    Sørensen, Hans Henrik B.; Hansen, Per Christian

    2014-01-01

    Algebraic iterative methods are routinely used for solving the ill-posed sparse linear systems arising in tomographic image reconstruction. Here we consider the algebraic reconstruction technique (ART) and the simultaneous iterative reconstruction techniques (SIRT), both of which rely on semiconv......Algebraic iterative methods are routinely used for solving the ill-posed sparse linear systems arising in tomographic image reconstruction. Here we consider the algebraic reconstruction technique (ART) and the simultaneous iterative reconstruction techniques (SIRT), both of which rely...... on semiconvergence. Block versions of these methods, based on a partitioning of the linear system, are able to combine the fast semiconvergence of ART with the better multicore properties of SIRT. These block methods separate into two classes: those that, in each iteration, access the blocks in a sequential manner...... a fixed relaxation parameter in each method, namely, the one that leads to the fastest semiconvergence. Computational results show that for multicore computers, the sequential approach is preferable....

  18. High performance parallelism pearls 2 multicore and many-core programming approaches

    CERN Document Server

    Jeffers, Jim

    2015-01-01

    High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t

  19. Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

    Directory of Open Access Journals (Sweden)

    Youngmin Kim

    2017-12-01

    Full Text Available In wireless sensor networks (WSNs, sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods.

  20. Pseudo-Random Number Generators for Vector Processors and Multicore Processors

    DEFF Research Database (Denmark)

    Fog, Agner

    2015-01-01

    Large scale Monte Carlo applications need a good pseudo-random number generator capable of utilizing both the vector processing capabilities and multiprocessing capabilities of modern computers in order to get the maximum performance. The requirements for such a generator are discussed. New ways...

  1. Special purpose processors for high energy physics applications

    International Nuclear Information System (INIS)

    Verkerk, C.

    1978-01-01

    The review on the subject of hardware processors from very fast decision logic for the split field magnet facility at CERN, to a point-finding processor used to relieve the data-acquisition minicomputer from the task of monitoring the SPS experiment is given. Block diagrams of decision making processor, point-finding processor, complanarity and opening angle processor and programmable track selector module are presented and discussed. The applications of fully programmable but slower processor on the one hand, and very fast and programmable decision logic on the other hand are given in this review

  2. Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Tumeo, Antonino; Villa, Oreste; Secchi, Simone; Chavarría-Miranda, Daniel

    2013-12-12

    String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearly proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are

  3. Low pain vs no pain multi-core Haskells

    DEFF Research Database (Denmark)

    Aswad, Mustafa; Trinder, Phil; Al Zain, Abyd

    2011-01-01

    to compare a ‘no pain’, i.e. entirely implicit, parallel implementation with three ‘low pain’, i.e. semi-explicit, language implementations. We report detailed studies comparing the parallel performance delivered. The comparative performance metric is speedup which normalises against sequential performance...... the parallelism. The results of the study are encouraging and, on occasion, surprising. We find that fully implicit parallelism as implemented in FDIP cannot yet compete with semi-explicit parallel approaches. Semi-explicit parallelism shows encouraging speedup for many of the programs in the test suite......Multicore and NUMA architectures are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is a systematic...

  4. Hybrid programming model for implicit PDE simulations on multicore architectures

    KAUST Repository

    Kaushik, Dinesh; Keyes, David E.; Balay, Satish; Smith, Barry F.

    2011-01-01

    The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.

  5. Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Su, Chun-Yi [Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)

    2014-12-16

    By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access

  6. Trigger and decision processors

    International Nuclear Information System (INIS)

    Franke, G.

    1980-11-01

    In recent years there have been many attempts in high energy physics to make trigger and decision processes faster and more sophisticated. This became necessary due to a permanent increase of the number of sensitive detector elements in wire chambers and calorimeters, and in fact it was possible because of the fast developments in integrated circuits technique. In this paper the present situation will be reviewed. The discussion will be mainly focussed upon event filtering by pure software methods and - rather hardware related - microprogrammable processors as well as random access memory triggers. (orig.)

  7. Multicore in production: advantages and limits of the multiprocess approach in the ATLAS experiment

    International Nuclear Information System (INIS)

    Binet, S; Calafiura, P; Lavrijsen, W; Leggett, C; Tatarkhanov, M; Tsulaia, V; Jha, M K; Lesny, D; Severini, H; Smith, D; Snyder, S; VanGemmeren, P; Washbrook, A

    2012-01-01

    The shared memory architecture of multicore CPUs provides HEP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelize HEP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm, which allowed us to achieve sharing of almost 80% of memory pages among event worker processes for certain types of reconstruction jobs with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to parallelize large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. The process of validating AthenaMP for production took ten months of concentrated effort and is expected to continue for several more months. Besides validating the software itself, an important and time-consuming aspect of running multicore applications in production was to configure the ATLAS distributed production system to handle multicore jobs. This entailed defining multicore batch queues, where the unit resource is not a core, but a whole computing node; monitoring the output of many event workers; and adapting the job definition layer to handle computing resources with different event throughputs. We will present scalability and memory usage studies, based on data gathered both on dedicated hardware and at the CERN Computer Center.

  8. Parallel Task Processing on a Multicore Platform in a PC-based Control System for Parallel Kinematics

    Directory of Open Access Journals (Sweden)

    Harald Michalik

    2009-02-01

    Full Text Available Multicore platforms are such that have one physical processor chip with multiple cores interconnected via a chip level bus. Because they deliver a greater computing power through concurrency, offer greater system density multicore platforms provide best qualifications to address the performance bottleneck encountered in PC-based control systems for parallel kinematic robots with heavy CPU-load. Heavy load control tasks are generated by new control approaches that include features like singularity prediction, structure control algorithms, vision data integration and similar tasks. In this paper we introduce the parallel task scheduling extension of a communication architecture specially tailored for the development of PC-based control of parallel kinematics. The Sche-duling is specially designed for the processing on a multicore platform. It breaks down the serial task processing of the robot control cycle and extends it with parallel task processing paths in order to enhance the overall control performance.

  9. Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

    Directory of Open Access Journals (Sweden)

    Hatem Ltaief

    2010-01-01

    Full Text Available The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009, 38–53 introduced the concept of tile algorithms in which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

  10. A Fault Detection Mechanism in a Data-flow Scheduled Multithreaded Processor

    NARCIS (Netherlands)

    Fu, J.; Yang, Q.; Poss, R.; Jesshope, C.R.; Zhang, C.

    2014-01-01

    This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled MultiThreaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection

  11. Power consumption in multicore fibre networks

    DEFF Research Database (Denmark)

    Nooruzzaman, Md; Jain, Saurabh; Jung, Yongmin

    2017-01-01

    We study potential energy savings in MCF-based networks compared to SMF-based ones in a Pan-European network topology based on the power consumption of recently fabricated cladding-pumped multi-core optical fibre amplifiers....

  12. Multi-core fiber undersea transmission systems

    DEFF Research Database (Denmark)

    Nooruzzaman, Md; Morioka, Toshio

    2017-01-01

    Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components.......Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components....

  13. Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

    Science.gov (United States)

    Sourouri, Mohammed; Birger Raknes, Espen

    2017-04-01

    In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most

  14. The LASS hardware processor

    International Nuclear Information System (INIS)

    Kunz, P.F.

    1976-01-01

    The problems of data analysis with hardware processors are reviewed and a description is given of a programmable processor. This processor, the 168/E, has been designed for use in the LASS multi-processor system; it has an execution speed comparable to the IBM 370/168 and uses the subset of IBM 370 instructions appropriate to the LASS analysis task. (Auth.)

  15. Multicore and GPU algorithms for Nussinov RNA folding

    Science.gov (United States)

    2014-01-01

    Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539

  16. XL-100S microprogrammable processor

    International Nuclear Information System (INIS)

    Gorbunov, N.V.; Guzik, Z.; Sutulin, V.A.; Forytski, A.

    1983-01-01

    The XL-100S microprogrammable processor providing the multiprocessor operation mode in the XL system crate is described. The processor meets the EUR 6500 CAMAC standards, address up to 4 Mbyte memory, and interacts with 7 CAMAC branchas. Eight external requests initiate operations preset by a sequence of microcommands in a memory of the capacity up to 64 kwords of 32-Git. The microprocessor architecture allows one to emulate commands of the majority of mini- or micro-computers, including floating point operations. The XL-100S processor may be used in various branches of experimental physics: for physical experiment apparatus control, fast selection of useful physical events, organization of the of input/output operations, organization of direct assess to memory included, etc. The Am2900 microprocessor set is used as an elementary base. The device is made in the form of a single width CAMAC module

  17. Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

    Energy Technology Data Exchange (ETDEWEB)

    Villa, Oreste; Tumeo, Antonino; Secchi, Simone; Manzano Franco, Joseph B.

    2012-12-31

    Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, we introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.

  18. Evaluating the scalability of HEP software and multi-core hardware

    CERN Document Server

    Jarp, S; Leduc, J; Nowak, A

    2011-01-01

    As researchers have reached the practical limits of processor performance improvements by frequency scaling, it is clear that the future of computing lies in the effective utilization of parallel and multi-core architectures. Since this significant change in computing is well underway, it is vital for HEP programmers to understand the scalability of their software on modern hardware and the opportunities for potential improvements. This work aims to quantify the benefit of new mainstream architectures to the HEP community through practical benchmarking on recent hardware solutions, including the usage of parallelized HEP applications.

  19. A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

    DEFF Research Database (Denmark)

    Li, Gang; Guan, Wei; Sierszecki, Krzysztof

    2012-01-01

    Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints....... This paper presents a multi-core architecture incorporating a hardware kernel on FPGAs, intended for high performance applications in control engineering domain. First, the hardware kernel is investigated on the basis of a component-based real-time kernel HARTEX (Hard Real-Time Executive for Control Systems...

  20. Solving Matrix Equations on Multi-Core and Many-Core Architectures

    Directory of Open Access Journals (Sweden)

    Peter Benner

    2013-11-01

    Full Text Available We address the numerical solution of Lyapunov, algebraic and differential Riccati equations, via the matrix sign function, on platforms equipped with general-purpose multicore processors and, optionally, one or more graphics processing units (GPUs. In particular, we review the solvers for these equations, as well as the underlying methods, analyze their concurrency and scalability and provide details on their parallel implementation. Our experimental results show that this class of hardware provides sufficient computational power to tackle large-scale problems, which only a few years ago would have required a cluster of computers.

  1. Safety-critical Java on a time-predictable processor

    DEFF Research Database (Denmark)

    Korsholm, Stephan E.; Schoeberl, Martin; Puffitsch, Wolfgang

    2015-01-01

    For real-time systems the whole execution stack needs to be time-predictable and analyzable for the worst-case execution time (WCET). This paper presents a time-predictable platform for safety-critical Java. The platform consists of (1) the Patmos processor, which is a time-predictable processor......; (2) a C compiler for Patmos with support for WCET analysis; (3) the HVM, which is a Java-to-C compiler; (4) the HVM-SCJ implementation which supports SCJ Level 0, 1, and 2 (for both single and multicore platforms); and (5) a WCET analysis tool. We show that real-time Java programs translated to C...... and compiled to a Patmos binary can be analyzed by the AbsInt aiT WCET analysis tool. To the best of our knowledge the presented system is the second WCET analyzable real-time Java system; and the first one on top of a RISC processor....

  2. Online track processor for the CDF upgrade

    International Nuclear Information System (INIS)

    Thomson, E. J.

    2002-01-01

    A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented

  3. Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors

    NARCIS (Netherlands)

    Grelck, C.; Herhut, S.; Jesshope, C.; Joslin, C.; Lankamp, M.; Scholz, S.-B.; Shafarenko, A.

    2009-01-01

    We present preliminary results from compiling the high-level, functional and data-parallel programming language SaC into a novel multi-core design: Microgrids of Self-Adaptive Virtual Processors (SVPs). The side-effect free nature of SaC in conjunction with its data-parallel foundation make it an

  4. Two-processor automatized system to control fast measurements of the magnetic field index of the JINR 10 GeV proton synchrotron

    International Nuclear Information System (INIS)

    Chernykh, E.V.

    1981-01-01

    A two-processor system comprizing a hard-wired module and ES-1010 computer to control measurements of the magnetic field index of the JINR 10 GeV proton synchrotron is described. The system comprises the control module, a computer interface and a parallel branch driver residing in CAMAC system crate. The control module controls analogue multiplexer and analogue-to-digital converter through their front panels and writes down the information into a buffer memory module through the CAMAC highway. The computer controls the system, reads the information into core memory, writes down it on a magnetic tape, processes it and outputs n=f(r) plots on TV monitor and printer. The system provides the measurement up to 100 points during a magnetic field rise and minimal time of measurement 50 μs [ru

  5. Parallelization of the preconditioned IDR solver for modern multicore computer systems

    Science.gov (United States)

    Bessonov, O. A.; Fedoseyev, A. I.

    2012-10-01

    This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).

  6. Probabilistic programmable quantum processors

    International Nuclear Information System (INIS)

    Buzek, V.; Ziman, M.; Hillery, M.

    2004-01-01

    We analyze how to improve performance of probabilistic programmable quantum processors. We show how the probability of success of the probabilistic processor can be enhanced by using the processor in loops. In addition, we show that an arbitrary SU(2) transformations of qubits can be encoded in program state of a universal programmable probabilistic quantum processor. The probability of success of this processor can be enhanced by a systematic correction of errors via conditional loops. Finally, we show that all our results can be generalized also for qudits. (Abstract Copyright [2004], Wiley Periodicals, Inc.)

  7. Reconfigurable Multicore Architectures for Streaming Applications

    NARCIS (Netherlands)

    Smit, Gerardus Johannes Maria; Kokkeler, Andre B.J.; Rauwerda, G.K.; Jacobs, J.W.M.; Nicolescu, G.; Mosterman, P.J.

    2009-01-01

    This chapter addresses reconfigurable heterogenous and homogeneous multicore system-on-chip (SoC) platforms for streaming digital signal processing applications, also called DSP applications. In streaming DSP applications, computations can be specified as a data flow graph with streams of data items

  8. Optimization of the coherence function estimation for multi-core central processing unit

    Science.gov (United States)

    Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

    2017-02-01

    The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.

  9. A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

    DEFF Research Database (Denmark)

    Li, Gang; Guan, Wei; Sierszecki, Krzysztof

    2012-01-01

    Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints. This pa......Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints......). Second, a heterogeneous multi-core architecture is investigated, focusing on its performance in relation to hard real-time constraints and predictable behavior. Third, the hardware implementation of HARTEX is designated to support the heterogeneous multi-core architecture. This hardware kernel has...... several advantages over a similar kernel implemented in software: higher-speed processing capability, parallel computation, and separation between the kernel itself and the applications being run. A microbenchmark has been used to compare the hardware kernel with the software kernel, and compare...

  10. High performance deformable image registration algorithms for manycore processors

    CERN Document Server

    Shackleford, James; Sharp, Gregory

    2013-01-01

    High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E

  11. Multicore in Production: Advantages and Limits of the Multiprocess Approach

    CERN Document Server

    Binet, S; The ATLAS collaboration; Lavrijsen, W; Leggett, Ch; Lesny, D; Jha, M K; Severini, H; Smith, D; Snyder, S; Tatarkhanov, M; Tsulaia, V; van Gemmeren, P; Washbrook, A

    2011-01-01

    The shared memory architecture of multicore CPUs provides HENP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelizing HENP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm which allows to share up to 50% memory pages among event worker processes with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to run in parallel large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. In spite of this, the process of validating athena multi-process for production took ten months of concentrated effort and is expected to continue for several more months. In general terms, we had two classes of problems in the multi-process port: merging the output fil...

  12. Median and Morphological Specialized Processors for a Real-Time Image Data Processing

    Directory of Open Access Journals (Sweden)

    Kazimierz Wiatr

    2002-01-01

    Full Text Available This paper presents the considerations on selecting a multiprocessor MISD architecture for fast implementation of the vision image processing. Using the author′s earlier experience with real-time systems, implementing of specialized hardware processors based on the programmable FPGA systems has been proposed in the pipeline architecture. In particular, the following processors are presented: median filter and morphological processor. The structure of a universal reconfigurable processor developed has been proposed as well. Experimental results are presented as delays on LCA level implementation for median filter, morphological processor, convolution processor, look-up-table processor, logic processor and histogram processor. These times compare with delays in general purpose processor and DSP processor.

  13. Embedded Processor Laboratory

    Data.gov (United States)

    Federal Laboratory Consortium — The Embedded Processor Laboratory provides the means to design, develop, fabricate, and test embedded computers for missile guidance electronics systems in support...

  14. Multithreading in vector processors

    Science.gov (United States)

    Evangelinos, Constantinos; Kim, Changhoan; Nair, Ravi

    2018-01-16

    In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.

  15. Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors

    OpenAIRE

    Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent

    2013-01-01

    Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...

  16. Operating system concepts for embedded multicores

    OpenAIRE

    Horst, Oliver; Schmidt, Adriaan

    2014-01-01

    Currently we can see an increasing adoption of multi-core platforms in the area of embedded systems. While these new hardware platforms offer the potential to satisfy the ever increasing demand for computational power, they pose considerable challenges with regard to software development. This affects the application software itself, but also the system design and architecture. Here, we address the consequences for operating system architecture in embedded systems. After dis-cussing current a...

  17. HardwareSoftware Co-design for Heterogeneous Multi-core Platforms The hArtes Toolchain

    CERN Document Server

    2012-01-01

    This book describes the results and outcome of the FP6 project, known as hArtes, which focuses on the development of an integrated tool chain targeting a heterogeneous multi core platform comprising of a general purpose processor (ARM or powerPC), a DSP (the diopsis) and an FPGA. The tool chain takes existing source code and proposes transformations and mappings such that legacy code can easily be ported to a modern, multi-core platform. Benefits of the hArtes approach, described in this book, include: Uses a familiar programming paradigm: hArtes proposes a familiar programming paradigm which is compatible with the widely used programming practice, irrespective of the target platform. Enables users to view multiple cores as a single processor: the hArtes approach abstracts away the heterogeneity as well as the multi-core aspect of the underlying hardware so the developer can view the platform as consisting of a single, general purpose processor. Facilitates easy porting of existing applications: hArtes provid...

  18. The FastTrack Real Time Processor and Its Impact on Muon Isolation, Tau and b-Jet Online Selections at ATLAS

    CERN Document Server

    Crescioli, F; The ATLAS collaboration; Zhang, J; Boveia, A; Bevacqua, V; Cheng, Y; Canelli, F; Bogdan, M; Dell'Orso, M; Bossini, E; Citterio, M; Dunford, M; Drake, G; Beretta, M; Genat, JF; Annovi, A; Kim, YK; Kimura, N; Andreazza, A; Kapliy, A; Kasten, M; Piendibene, M; Negri, A; Meroni, C; Giannetti, P; Melachrinos, C; Hoff, J; Liberali, V; McCarn, A; Neubauer, M; Tang, F; Shochet, M; Stabile, A; Sartori, L; Sabatini, F; Proudfoot, J; Riva, M; Liu, T; Punzi, G; Vercesi, V; Tuggle, J; Todri, A; Tripiccione, R; Lanza, A; Wu, J; Yorita, K; Volpi, G; Vitullo, R.A; Sacco, I

    2010-01-01

    As the LHC luminosity is ramped up to 31034 cm−2 s−1 and beyond, the high rates, multiplicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the physics we are most interested in, and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK) is a proposed upgrade to the current ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK solves the combinatorial challenge inherent to tracking by exploiting massive parallelism of associative memories that can compa...

  19. The Fast Tracker Real Time Processor and Its Impact on the Muon Isolation, Tau amp; b-Jet Online Selections at ATLAS

    CERN Document Server

    Giannetti, P; The ATLAS collaboration; Crescioli, F

    2011-01-01

    As the LHC luminosity is ramped up to the SLHC Phase I level and beyond, the high rates, multiplicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the physics we are most interested in, and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK)[1], is a proposed upgrade to the current ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK solves the combinatorial challenge inherent to tracking by exploiting massive parallelism of associative memories [2] that ...

  20. Integrated fuel processor development

    International Nuclear Information System (INIS)

    Ahmed, S.; Pereira, C.; Lee, S. H. D.; Krumpelt, M.

    2001-01-01

    The Department of Energy's Office of Advanced Automotive Technologies has been supporting the development of fuel-flexible fuel processors at Argonne National Laboratory. These fuel processors will enable fuel cell vehicles to operate on fuels available through the existing infrastructure. The constraints of on-board space and weight require that these fuel processors be designed to be compact and lightweight, while meeting the performance targets for efficiency and gas quality needed for the fuel cell. This paper discusses the performance of a prototype fuel processor that has been designed and fabricated to operate with liquid fuels, such as gasoline, ethanol, methanol, etc. Rated for a capacity of 10 kWe (one-fifth of that needed for a car), the prototype fuel processor integrates the unit operations (vaporization, heat exchange, etc.) and processes (reforming, water-gas shift, preferential oxidation reactions, etc.) necessary to produce the hydrogen-rich gas (reformate) that will fuel the polymer electrolyte fuel cell stacks. The fuel processor work is being complemented by analytical and fundamental research. With the ultimate objective of meeting on-board fuel processor goals, these studies include: modeling fuel cell systems to identify design and operating features; evaluating alternative fuel processing options; and developing appropriate catalysts and materials. Issues and outstanding challenges that need to be overcome in order to develop practical, on-board devices are discussed

  1. Real-Time Operating Systems for Multicore Embedded Systems

    OpenAIRE

    Tomiyama, Hiroyuki; Honda, Shinya; Takada, Hiroaki

    2008-01-01

    Multicore systems-on-chip have become popular inthe design of embedded systems in order to simultaneously achieve high performance and low power consumption. On the software side, real-time operating systems are necessary in orderto handle growing complexity of embedded software. This paper describes requirements, design principles and implementation techniques for real-time operating systems to be used inasymmetric multicore systems.

  2. Effective particle magnetic moment of multi-core particles

    NARCIS (Netherlands)

    Ahrentorp, F.; Astalan, A.; Blomgren, J.; Jonasson, C.; Wetterskog, E.; Svedlindh, P.; Lak, A.; Ludwig, F.; Van IJzendoorn, L.J.; Westphal, F.; Grüttner, C.; Gehrke, N.; Gustafsson, S.; Olsson, E.; Johansson, C.

    2015-01-01

    In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron

  3. A 96-channel FPGA-based Time-to-Digital Converter (TDC) and fast trigger processor module with multi-hit capability and pipeline

    International Nuclear Information System (INIS)

    Bogdan, Mircea; Frisch, Henry; Heintz, Mary; Paramonov, Alexander; Sanders, Harold; Chappa, Steve; DeMaat, Robert; Klein, Rod; Miao, Ting; Wilson, Peter; Phillips, Thomas J.

    2005-01-01

    We describe an field-programmable gate arrays based (FPGA), 96-channel, Time-to-Digital converter (TDC) and trigger logic board intended for use with the Central Outer Tracker (COT) [T. Affolder et al., Nucl. Instr. and Meth. A 526 (2004) 249] in the CDF Experiment [The CDF-II detector is described in the CDF Technical Design Report (TDR), FERMILAB-Pub-96/390-E. The TDC described here is intended as a further upgrade beyond that described in the TDR] at the Fermilab Tevatron. The COT system is digitized and read out by 315 TDC cards, each serving 96 wires of the chamber. The TDC is physically configured as a 9U VME card. The functionality is almost entirely programmed in firmware in two Altera Stratix FPGAs. The special capabilities of this device are the availability of 840MHz LVDS inputs, multiple phase-locked clock modules, and abundant memory. The TDC system operates with an input resolution of 1.2ns, a minimum input pulse width of 4.8ns and a minimum separation of 4.8ns between pulses. Each input can accept up to 7 hits per collision. The time-to-digital conversion is done by first sampling each of the 96 inputs in 1.2-ns bins and filling a circular memory; the memory addresses of logical transitions (edges) in the input data are then translated into the time of arrival and width of the COT pulses. Memory pipelines with a depth of 5.5μs allow deadtime-less operation in the first-level trigger; the data are multiple-buffered to diminish deadtime in the second-level trigger. The complete process of edge-detection and filling of buffers for readout takes 12μs. The TDC VME interface allows a 64-bit Chain Block Transfer of multiple boards in a crate with transfer-rates up to 47Mbytes/s. The TDC module also produces prompt trigger data every Tevatron crossing via a deadtimeless fast logic path that can be easily reprogrammed. The trigger bits are clocked onto the P3 VME backplane connector with a 22-ns clock for transmission to the trigger. The full TDC design and

  4. FAST

    DEFF Research Database (Denmark)

    Zuidmeer-Jongejan, Laurian; Fernandez-Rivas, Montserrat; Poulsen, Lars K.

    2012-01-01

    ABSTRACT: The FAST project (Food Allergy Specific Immunotherapy) aims at the development of safe and effective treatment of food allergies, targeting prevalent, persistent and severe allergy to fish and peach. Classical allergen-specific immunotherapy (SIT), using subcutaneous injections with aqu...

  5. Logistic Fuel Processor Development

    National Research Council Canada - National Science Library

    Salavani, Reza

    2004-01-01

    ... to light gases then steam reform the light gases into hydrogen rich stream. This report documents the efforts in developing a fuel processor capable of providing hydrogen to a 3kW fuel cell stack...

  6. 3081/E processor

    International Nuclear Information System (INIS)

    Kunz, P.F.; Gravina, M.; Oxoby, G.

    1984-04-01

    The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future

  7. Logistic Fuel Processor Development

    National Research Council Canada - National Science Library

    Salavani, Reza

    2004-01-01

    The Air Base Technologies Division of the Air Force Research Laboratory has developed a logistic fuel processor that removes the sulfur content of the fuel and in the process converts logistic fuel...

  8. Adaptive signal processor

    Energy Technology Data Exchange (ETDEWEB)

    Walz, H.V.

    1980-07-01

    An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 ..mu..sec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed.

  9. Adaptive signal processor

    International Nuclear Information System (INIS)

    Walz, H.V.

    1980-07-01

    An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 μsec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed

  10. Efficient Co-Simulation of Multicore Systems

    DEFF Research Database (Denmark)

    Brock-Nannestad, Laust; Karlsson, Sven

    2011-01-01

    the hardware state of a multicore design while it is running on an FPGA. With minimal changes to the design and using only the built-in JTAG programming and debug- ging facilities, we describe how to transfer the state from an FPGA to a simulator. We also show how the state can be transferred back from...... the simulator to FPGA. Given that the design runs in real-time on the FPGA, the end result is speed improvements of orders of magnitude over traditional pure software simulation....

  11. Application Scalability and Performance on Multicore Architectures

    National Research Council Canada - National Science Library

    Simon, Tyler A; Cable, Sam B; Mahmoodi, Mahin

    2007-01-01

    The US Army Engineer Research and Development Center Major Shared Resource Center has recently upgraded its Cray XT3 system from single-core to dual-core AMD Opteron processors and has procured a quad...

  12. Array processor architecture

    Science.gov (United States)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1983-01-01

    A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

  13. T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

    Directory of Open Access Journals (Sweden)

    Youngmin Kim

    2016-07-01

    Full Text Available Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM. Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.

  14. An Energy-Aware Runtime Management of Multi-Core Sensory Swarms

    Directory of Open Access Journals (Sweden)

    Sungchan Kim

    2017-08-01

    Full Text Available In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.

  15. Functional unit for a processor

    NARCIS (Netherlands)

    Rohani, A.; Kerkhoff, Hans G.

    2013-01-01

    The invention relates to a functional unit for a processor, such as a Very Large Instruction Word Processor. The invention further relates to a processor comprising at least one such functional unit. The invention further relates to a functional unit and processor capable of mitigating the effect of

  16. Real time processor for array speckle interferometry

    Science.gov (United States)

    Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

    1989-02-01

    The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.

  17. The UA1 upgrade calorimeter trigger processor

    International Nuclear Information System (INIS)

    Bains, M.; Charleton, D.; Ellis, N.; Garvey, J.; Gregory, J.; Jimack, M.P.; Jovanovic, P.; Kenyon, I.R.; Baird, S.A.; Campbell, D.; Cawthraw, M.; Coughlan, J.; Flynn, P.; Galagedera, S.; Grayer, G.; Halsall, R.; Shah, T.P.; Stephens, R.; Biddulph, P.; Eisenhandler, E.; Fensome, I.F.; Landon, M.; Robinson, D.; Oliver, J.; Sumorok, K.

    1990-01-01

    The increased luminosity of the improved CERN Collider and the more subtle signals of second-generation collider physics demand increasingly sophisticated triggering. We have built a new first-level trigger processor designed to use the excellent granularity of the UA1 upgrade calorimeter. This device is entirely digital and handles events in 1.5 μs, thus introducing no dead time. Its most novel feature is fast two-dimensional electromagnetic cluster-finding with the possibility of demanding an isolated shower of limited penetration. The processor allows multiple combinations of triggers on electromagnetic showers, hadronic jets and energy sums, including a total-energy veto of multiple interactions and a full vector sum of missing transverse energy. This hard-wired processor is about five times more powerful than its predecessor, and makes extensive use of pipelining techniques. It was used extensively in the 1988 and 1989 runs of the CERN Collider. (orig.)

  18. The UA1 upgrade calorimeter trigger processor

    International Nuclear Information System (INIS)

    Bains, N.; Baird, S.A.; Biddulph, P.

    1990-01-01

    The increased luminosity of the improved CERN Collider and the more subtle signals of second-generation collider physics demand increasingly sophisticated triggering. We have built a new first-level trigger processor designed to use the excellent granularity of the UA1 upgrade calorimeter. This device is entirely digital and handles events in 1.5 μs, thus introducing no deadtime. Its most novel feature is fast two-dimensional electromagnetic cluster-finding with the possibility of demanding an isolated shower of limited penetration. The processor allows multiple combinations of triggers on electromagnetic showers, hadronic jets and energy sums, including a total-energy veto of multiple interactions and a full vector sum of missing transverse energy. This hard-wired processor is about five times more powerful than its predecessor, and makes extensive use of pipelining techniques. It was used extensively in the 1988 and 1989 runs of the CERN Collider. (author)

  19. Program Execution on Reconfigurable Multicore Architectures

    Directory of Open Access Journals (Sweden)

    Sanjiva Prasad

    2016-06-01

    Full Text Available Based on the two observations that diverse applications perform better on different multicore architectures, and that different phases of an application may have vastly different resource requirements, Pal et al. proposed a novel reconfigurable hardware approach for executing multithreaded programs. Instead of mapping a concurrent program to a fixed architecture, the architecture adaptively reconfigures itself to meet the application's concurrency and communication requirements, yielding significant improvements in performance. Based on our earlier abstract operational framework for multicore execution with hierarchical memory structures, we describe execution of multithreaded programs on reconfigurable architectures that support a variety of clustered configurations. Such reconfiguration may not preserve the semantics of programs due to the possible introduction of race conditions arising from concurrent accesses to shared memory by threads running on the different cores. We present an intuitive partial ordering notion on the cluster configurations, and show that the semantics of multithreaded programs is always preserved for reconfigurations "upward" in that ordering, whereas semantics preservation for arbitrary reconfigurations can be guaranteed for well-synchronised programs. We further show that a simple approximate notion of efficiency of execution on the different configurations can be obtained using the notion of amortised bisimulations, and extend it to dynamic reconfiguration.

  20. The UA1 trigger processor

    International Nuclear Information System (INIS)

    Grayer, G.H.

    1981-01-01

    Experiment UA1 is a large multi-purpose spectrometer at the CERN proton-antiproton collider, scheduled for late 1981. The principal trigger is formed on the basis of the energy deposition in calorimeters. A trigger decision taken in under 2.4 microseconds can avoid dead time losses due to the bunched nature of the beam. To achieve this we have built fast 8-bit charge to digital converters followed by two identical digital processors tailored to the experiment. The outputs of groups of the 2440 photomultipliers in the calorimeters are summed to form a total of 288 input channels to the ADCs. A look-up table in RAM is used to convert the digitised photomultiplier signals to energy in one processor, combinations of input channels, and also counts the number of clusters with electromagnetic or hadronic energy above pre-determined levels. Up to twelve combinations of these conditions, together with external information, may be combined in coincidence or in veto to form the final trigger. Provision has been made for testing using simulated data in an off-line mode, and sampling real data when on-line. (orig.)

  1. Cross talk analysis in multicore optical fibers by supermode theory.

    Science.gov (United States)

    Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz

    2016-08-15

    We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.

  2. Design issues for numerical libraries on scalable multicore architectures

    International Nuclear Information System (INIS)

    Heroux, M A

    2008-01-01

    Future generations of scalable computers will rely on multicore nodes for a significant portion of overall system performance. At present, most applications and libraries cannot exploit multiple cores beyond running addition MPI processes per node. In this paper we discuss important multicore architecture issues, programming models, algorithms requirements and software design related to effective use of scalable multicore computers. In particular, we focus on important issues for library research and development, making recommendations for how to effectively develop libraries for future scalable computer systems

  3. 3081//sub E/ processor

    International Nuclear Information System (INIS)

    Kunz, P.F.; Gravina, M.; Oxoby, G.; Trang, Q.; Fucci, A.; Jacobs, D.; Martin, B.; Storr, K.

    1983-03-01

    Since the introduction of the 168//sub E/, emulating processors have been successful over an amazingly wide range of applications. This paper will describe a second generation processor, the 3081//sub E/. This new processor, which is being developed as a collaboration between SLAC and CERN, goes beyond just fixing the obvious faults of the 168//sub E/. Not only will the 3081//sub E/ have much more memory space, incorporate many more IBM instructions, and have much more memory space, incorporate many more IBM instructions, and have full double precision floating point arithmetic, but it will also have faster execution times and be much simpler to build, debug, and maintain. The simple interface and reasonable cost of the 168//sub E/ will be maintained for the 3081//sub E/

  4. Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems

    Science.gov (United States)

    2013-12-01

    Loss Spectra” Proceedings of SPIE 8255, (2012) and in a journal publication: M. T. Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O...Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O. Fimland and L. F. Lester, "Analytical Modeling of the Temperature Performance of

  5. Monte Carlo simulation of magnetic multi-core nanoparticles

    International Nuclear Information System (INIS)

    Schaller, Vincent; Wahnstroem, Goeran; Sanz-Velasco, Anke; Enoksson, Peter; Johansson, Christer

    2009-01-01

    In this paper, a Monte Carlo simulation is carried out to evaluate the equilibrium magnetization of magnetic multi-core nanoparticles in a liquid and subjected to a static magnetic field. The particles contain a magnetic multi-core consisting of a cluster of magnetic single-domains of magnetite. We show that the magnetization of multi-core nanoparticles cannot be fully described by a Langevin model. Inter-domain dipolar interactions and domain magnetic anisotropy contribute to decrease the magnetization of the particles, whereas the single-domain size distribution yields an increase in magnetization. Also, we show that the interactions affect the effective magnetic moment of the multi-core nanoparticles.

  6. Programmable gain equalizer for multi-core fiber amplifiers

    NARCIS (Netherlands)

    Fontaine, N.K.; Guan, B.; Ryf, R.; Chen, H.; Koonen, A.M.J.; Ben Yoo, S.J.; Abedin, K.; Fini, J.; Taunay, T.F.; Neilson, D.T.

    2014-01-01

    We demonstrate a programmable gain equalizer for 7-core fiber that can independently equalize spectra or block wavelengths in each core across the C-band. It is spliced directly to a side-pumped multi-core amplifying fiber.

  7. MPC Related Computational Capabilities of ARMv7A Processors

    DEFF Research Database (Denmark)

    Frison, Gianluca; Jørgensen, John Bagterp

    2015-01-01

    In recent years, the mass market of mobile devices has pushed the demand for increasingly fast but cheap processors. ARM, the world leader in this sector, has developed the Cortex-A series of processors with focus on computationally intensive applications. If properly programmed, these processors...... are powerful enough to solve the complex optimization problems arising in MPC in real-time, while keeping the traditional low-cost and low-power consumption. This makes these processors ideal candidates for use in embedded MPC. In this paper, we investigate the floating-point capabilities of Cortex A7, A9...... and A15 and show how to exploit the unique features of each processor to obtain the best performance, in the context of a novel implementation method for the linear-algebra routines used in MPC solvers. This method adapts high-performance computing techniques to the needs of embedded MPC. In particular...

  8. Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

    Directory of Open Access Journals (Sweden)

    Xiangyu Li

    2017-02-01

    Full Text Available This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430, and that it can make a system do more valuable works and make more than 99.9% use of the power budget.

  9. Automatic Functionality Assignment to AUTOSAR Multicore Distributed Architectures

    DEFF Research Database (Denmark)

    Maticu, Florin; Pop, Paul; Axbrink, Christian

    2016-01-01

    The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, where several functions may share resources on an ECU. In addition, multicore ECUs are being adopted because...... of better performance, cost, size, fault-tolerance and power consumption. In this paper we present an approach for the automatic software functionality assignment to multicore distributed architectures. We consider that the systems use the AUTomotive Open System ARchitecture (AUTOSAR). The functionality...

  10. Integrated Optoelectronic Networks for Application-Driven Multicore Computing

    Science.gov (United States)

    2017-05-08

    AFRL-AFOSR-VA-TR-2017-0102 Integrated Optoelectronic Networks for Application- Driven Multicore Computing Sudeep Pasricha COLORADO STATE UNIVERSITY...AND SUBTITLE Integrated Optoelectronic Networks for Application-Driven Multicore Computing 5a.  CONTRACT NUMBER 5b.  GRANT NUMBER FA9550-13-1-0110 5c...and supportive materials with innovative architectural designs that integrate these components according to system-wide application needs. 15

  11. The Central Trigger Processor (CTP)

    CERN Multimedia

    Franchini, Matteo

    2016-01-01

    The Central Trigger Processor (CTP) receives trigger information from the calorimeter and muon trigger processors, as well as from other sources of trigger. It makes the Level-1 decision (L1A) based on a trigger menu.

  12. Very Long Instruction Word Processors

    Indian Academy of Sciences (India)

    Pentium Processor have modified the processor architecture to exploit parallelism in a program. .... The type of operation itself is encoded using 14 bits. .... text of designing simple architectures with low power consump- tion and execute x86 ...

  13. Data Rate Estimation for Wireless Core-to-Cache Communication in Multicore CPUs

    Directory of Open Access Journals (Sweden)

    M. Komar

    2015-01-01

    Full Text Available In this paper, a principal architecture of common purpose CPU and its main components are discussed, CPUs evolution is considered and drawbacks that prevent future CPU development are mentioned. Further, solutions proposed so far are addressed and a new CPU architecture is introduced. The proposed architecture is based on wireless cache access that enables a reliable interaction between cores in multicore CPUs using terahertz band, 0.1-10THz. The presented architecture addresses the scalability problem of existing processors and may potentially allow to scale them to tens of cores. As in-depth analysis of the applicability of the suggested architecture requires accurate prediction of traffic in current and next generations of processors, we consider a set of approaches for traffic estimation in modern CPUs discussing their benefits and drawbacks. The authors identify traffic measurements by using existing software tools as the most promising approach for traffic estimation, and they use Intel Performance Counter Monitor for this purpose. Three types of CPU loads are considered including two artificial tests and background system load. For each load type the amount of data transmitted through the L2-L3 interface is reported for various input parameters including the number of active cores and their dependences on the number of cores and operational frequency.

  14. Getting More From Your Multicore: Exploiting OpenMP for Astronomy

    Science.gov (United States)

    Noble, M. S.

    2008-08-01

    Motivated by the emergence of multicore architectures, and the reality that parallelism is rarely used for analysis in observational astronomy, we demonstrate how general users may employ tightly-coupled multiprocessors in scriptable research calculations while requiring no special knowledge of parallel programming. Our method rests on the observation that much of the appeal of high-level vectorized languages like IDL or MatLab stems from relatively simple internal loops over regular array structures, and that these loops are highly amenable to automatic parallelization with OpenMP. We discuss how ISIS, an open-source astrophysical analysis system embedding the slang numerical language, was easily adapted to exploit this pattern. Drawing from a common astrophysical problem, model fitting, we present beneficial speedups for several machine and compiler configurations. These results complement our previous efforts with PVM, and together lead us to believe that ISIS is the only general purpose spectroscopy system in which such a range of parallelism - from single processors on multiple machines to multiple processors on single machines - has been demonstrated.

  15. The Molen Polymorphic Media Processor

    NARCIS (Netherlands)

    Kuzmanov, G.K.

    2004-01-01

    In this dissertation, we address high performance media processing based on a tightly coupled co-processor architectural paradigm. More specifically, we introduce a reconfigurable media augmentation of a general purpose processor and implement it into a fully operational processor prototype. The

  16. Dual-core Itanium Processor

    CERN Multimedia

    2006-01-01

    Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.

  17. ATCA/AXIe compatible board for fast control and data acquisition in nuclear fusion experiments

    International Nuclear Information System (INIS)

    Batista, A.J.N.; Leong, C.; Bexiga, V.; Rodrigues, A.P.; Combo, A.; Carvalho, B.B.; Fortunato, J.; Correia, M.; Teixeira, J.P.; Teixeira, I.C.; Sousa, J.; Gonçalves, B.; Varandas, C.A.F.

    2012-01-01

    Highlights: ► High performance board for fast control and data acquisition. ► Large IO channel number per board with galvanic isolation. ► Optimized for high reliability and availability. ► Targeted for nuclear fusion experiments with long duration discharges. ► To be used on the ITER Fast Plant System Controller prototype. - Abstract: An in-house development of an Advanced Telecommunications Computing Architecture (ATCA) board for fast control and data acquisition, with Input/Output (IO) processing capability, is presented. The architecture, compatible with the ATCA (PICMG 3.4) and ATCA eXtensions for Instrumentation (AXIe) specifications, comprises a passive Rear Transition Module (RTM) for IO connectivity to ease hot-swap maintenance and simultaneously to increase cabling life cycle. The board complies with ITER Fast Plant System Controller (FPSC) guidelines for rear IO connectivity and redundancy, in order to provide high levels of reliability and availability to the control and data acquisition systems of nuclear fusion devices with long duration plasma discharges. Simultaneously digitized data from all Analog to Digital Converters (ADC) of the board can be filtered/decimated in a Field Programmable Gate Array (FPGA), decreasing data throughput, increasing resolution, and sent through Peripheral Component Interconnect (PCI) Express to multi-core processors in the ATCA shelf hub slots. Concurrently the multi-core processors can update the board Digital to Analog Converters (DAC) in real-time. Full-duplex point-to-point communication links between all FPGAs, of peer boards inside the shelf, allow the implementation of distributed algorithms and Multi-Input Multi-Output (MIMO) systems. Support for several timing and synchronization solutions is also provided. Some key features are onboard ADC or DAC modules with galvanic isolation, Xilinx Virtex 6 FPGA, standard Dual Data Rate (DDR) 3 SODIMM memory, standard CompactFLASH memory card, Intelligent

  18. Multimode power processor

    Science.gov (United States)

    O'Sullivan, George A.; O'Sullivan, Joseph A.

    1999-01-01

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

  19. Multi-core processing and scheduling performance in CMS

    International Nuclear Information System (INIS)

    Hernández, J M; Evans, D; Foulkes, S

    2012-01-01

    Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.

  20. muBLASTP: database-indexed protein sequence search on multicore CPUs.

    Science.gov (United States)

    Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

    2016-11-04

    The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

  1. Bulk-memory processor for data acquisition

    International Nuclear Information System (INIS)

    Nelson, R.O.; McMillan, D.E.; Sunier, J.W.; Meier, M.; Poore, R.V.

    1981-01-01

    To meet the diverse needs and data rate requirements at the Van de Graaff and Weapons Neutron Research (WNR) facilities, a bulk memory system has been implemented which includes a fast and flexible processor. This bulk memory processor (BMP) utilizes bit slice and microcode techniques and features a 24 bit wide internal architecture allowing direct addressing of up to 16 megawords of memory and histogramming up to 16 million counts per channel without overflow. The BMP is interfaced to the MOSTEK MK 8000 bulk memory system and to the standard MODCOMP computer I/O bus. Coding for the BMP both at the microcode level and with macro instructions is supported. The generalized data acquisition system has been extended to support the BMP in a manner transparent to the user

  2. High-spatial-multiplicity multi-core fibres for future dense space-division-multiplexing system

    DEFF Research Database (Denmark)

    Matsuo, Shoichiro; Takenaga, Katsuhiro; Saitoh, Kunimasa

    2015-01-01

    Design and fabrication results of high-spatial-multiplicity multi-core fibres are presented. A 30-core single-mode multi-core fibre and a 36-spatial-channels multi-core fibre with low differential mode delay have been realized with low-crosstalk characteristics through optimisation of core struct...

  3. High core count single-mode multicore fiber for dense space division multiplexing

    DEFF Research Database (Denmark)

    Aikawa, K.; Sasaki, Y.; Amma, Y.

    2016-01-01

    Multicore fibers and few-mode fibers have the potential to realize dense-space-division multiplexing systems. Several dense-space-division multiplexing system transmission experiments over multicore fibers and few-mode fibers have been demonstrated so far. Multicore fibers, including recent resul...

  4. Video frame processor

    International Nuclear Information System (INIS)

    Joshi, V.M.; Agashe, Alok; Bairi, B.R.

    1993-01-01

    This report provides technical description regarding the Video Frame Processor (VFP) developed at Bhabha Atomic Research Centre. The instrument provides capture of video images available in CCIR format. Two memory planes each with a capacity of 512 x 512 x 8 bit data enable storage of two video image frames. The stored image can be processed on-line and on-line image subtraction can also be carried out for image comparisons. The VFP is a PC Add-on board and is I/O mapped within the host IBM PC/AT compatible computer. (author). 9 refs., 4 figs., 19 photographs

  5. Optical Finite Element Processor

    Science.gov (United States)

    Casasent, David; Taylor, Bradley K.

    1986-01-01

    A new high-accuracy optical linear algebra processor (OLAP) with many advantageous features is described. It achieves floating point accuracy, handles bipolar data by sign-magnitude representation, performs LU decomposition using only one channel, easily partitions and considers data flow. A new application (finite element (FE) structural analysis) for OLAPs is introduced and the results of a case study presented. Error sources in encoded OLAPs are addressed for the first time. Their modeling and simulation are discussed and quantitative data are presented. Dominant error sources and the effects of composite error sources are analyzed.

  6. Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

    KAUST Repository

    Malas, T.; Hager, G.; Ltaief, Hatem; Keyes, David E.

    2015-01-01

    Today’s high-end multicore systems are characterized by a deep memory hierarchy, i.e., several levels of local and shared caches, with limited size and bandwidth per core. The ever-increasing gap between the processor and memory speed will further

  7. Effective particle magnetic moment of multi-core particles

    International Nuclear Information System (INIS)

    Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian; Wetterskog, Erik; Svedlindh, Peter; Lak, Aidin; Ludwig, Frank; IJzendoorn, Leo J. van; Westphal, Fritz; Grüttner, Cordula; Gehrke, Nicole; Gustafsson, Stefan; Olsson, Eva; Johansson, Christer

    2015-01-01

    In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems – BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm – and one single-core particle system – SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm

  8. Effective particle magnetic moment of multi-core particles

    Science.gov (United States)

    Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian; Wetterskog, Erik; Svedlindh, Peter; Lak, Aidin; Ludwig, Frank; van IJzendoorn, Leo J.; Westphal, Fritz; Grüttner, Cordula; Gehrke, Nicole; Gustafsson, Stefan; Olsson, Eva; Johansson, Christer

    2015-04-01

    In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems - BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm - and one single-core particle system - SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm.

  9. Effective particle magnetic moment of multi-core particles

    Energy Technology Data Exchange (ETDEWEB)

    Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden); Wetterskog, Erik; Svedlindh, Peter [Department of Engineering Sciences, Uppsala University, Box 534, SE-751 21 Uppsala (Sweden); Lak, Aidin; Ludwig, Frank [Institute of Electrical Measurement and Fundamental Electrical Engineering, TU Braunschweig, D‐38106 Braunschweig Germany (Germany); IJzendoorn, Leo J. van [Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven (Netherlands); Westphal, Fritz; Grüttner, Cordula [Micromod Partikeltechnologie GmbH, D ‐18119 Rostock (Germany); Gehrke, Nicole [nanoPET Pharma GmbH, D ‐10115 Berlin Germany (Germany); Gustafsson, Stefan; Olsson, Eva [Department of Applied Physics, Chalmers University of Technology, SE-412 96 Göteborg (Sweden); Johansson, Christer, E-mail: christer.johansson@acreo.se [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden)

    2015-04-15

    In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems – BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm – and one single-core particle system – SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm.

  10. Distributed fingerprint enhancement on a multicore cluster

    CSIR Research Space (South Africa)

    Khanyile, NP

    2012-07-01

    Full Text Available stream_source_info Khanyile1_2012.pdf.txt stream_content_type text/plain stream_size 36623 Content-Encoding ISO-8859-1 stream_name Khanyile1_2012.pdf.txt Content-Type text/plain; charset=ISO-8859-1 Distributed... access patterns, making file locking very inefficient. Locking a file region while using this type of partition ultimately locks the entire file, resulting in completely serialized I/O operations which renders MPI parallel I/O useless. Idle processors...

  11. AMD's 64-bit Opteron processor

    CERN Multimedia

    CERN. Geneva

    2003-01-01

    This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...

  12. NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

    Science.gov (United States)

    Cheung, Kit; Schultz, Simon R; Luk, Wayne

    2015-01-01

    NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.

  13. Heterogeneous Multicore Parallel Programming for Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Francois Bodin

    2009-01-01

    Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.

  14. GoCxx: a tool to easily leverage C++ legacy code for multicore-friendly Go libraries and frameworks

    International Nuclear Information System (INIS)

    Binet, Sébastien

    2012-01-01

    Current HENP libraries and frameworks were written before multicore systems became widely deployed and used. From this environment, a ‘single-thread’ processing model naturally emerged but the implicit assumptions it encouraged are greatly impairing our abilities to scale in a multicore/manycore world. Writing scalable code in C++ for multicore architectures, while doable, is no panacea. Sure, C++11 will improve on the current situation (by standardizing on std::thread, introducing lambda functions and defining a memory model) but it will do so at the price of complicating further an already quite sophisticated language. This level of sophistication has probably already strongly motivated analysis groups to migrate to CPython, hoping for its current limitations with respect to multicore scalability to be either lifted (Grand Interpreter Lock removal) or for the advent of a new Python VM better tailored for this kind of environment (PyPy, Jython, …) Could HENP migrate to a language with none of the deficiencies of C++ (build time, deployment, low level tools for concurrency) and with the fast turn-around time, simplicity and ease of coding of Python? This paper will try to make the case for Go - a young open source language with built-in facilities to easily express and expose concurrency - being such a language. We introduce GoCxx, a tool leveraging gcc-xml's output to automatize the tedious work of creating Go wrappers for foreign languages, a critical task for any language wishing to leverage legacy and field-tested code. We will conclude with the first results of applying GoCxx to real C++ code.

  15. The design of a graphics processor

    International Nuclear Information System (INIS)

    Holmes, M.; Thorne, A.R.

    1975-12-01

    The design of a graphics processor is described which takes into account known and anticipated user requirements, the availability of cheap minicomputers, the state of integrated circuit technology, and the overall need to minimise cost for a given performance. The main user needs are the ability to display large high resolution pictures, and to dynamically change the user's view in real time by means of fast coordinate processing hardware. The transformations that can be applied to 2D or 3D coordinates either singly or in combination are: translation, scaling, mirror imaging, rotation, and the ability to map the transformation origin on to any point on the screen. (author)

  16. Treecode with a Special-Purpose Processor

    Science.gov (United States)

    Makino, Junichiro

    1991-08-01

    We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops.

  17. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    Science.gov (United States)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  18. Communication Driven Mapping of Applications on Multicore Platforms

    NARCIS (Netherlands)

    Ashraf, I.

    2016-01-01

    Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is

  19. Comparing and Optimising Parallel Haskell Implementations for Multicore Machines

    DEFF Research Database (Denmark)

    Berthold, Jost; Marlow, Simon; Hammond, Kevin

    2009-01-01

    In this paper, we investigate the differences and tradeoffs imposed by two parallel Haskell dialects running on multicore machines. GpH and Eden are both constructed using the highly-optimising sequential GHC compiler, and share thread scheduling, and other elements, from a common code base. The ...

  20. Hardware/software virtualization for the reconfigurable multicore platform.

    NARCIS (Netherlands)

    Ferger, M.; Al Kadi, M.; Hübner, M.; Koedam, M.L.P.J.; Sinha, S.S.; Goossens, K.G.W.; Marchesan Almeida, Gabriel; Rodrigo Azambuja, J.; Becker, Juergen

    2012-01-01

    This paper presents the Flex Tiles approach for the virtualization of hardware and software for a reconfigurable multicore architecture. The approach enables the virtualization of a dynamic tile-based hardware architecture consisting of processing tiles connected via a network-on-chip and a

  1. Multicore fibers for high-capacity submarine transmission systems

    DEFF Research Database (Denmark)

    Nooruzzaman, Md.; Morioka, Toshio

    2018-01-01

    Applications of multicore fibers (MCFs) in undersea transmission systems are investigated, and various potential architectures of branching units for MCF-based undersea transmission systems are presented. Some MCF-based submarine network architectures based on the amount of data traffic are also...

  2. Automatic generation of application specific FPGA multicore accelerators

    DEFF Research Database (Denmark)

    Hindborg, Andreas Erik; Schleuniger, Pascal; Jensen, Nicklas Bo

    2014-01-01

    High performance computing systems make increasing use of hardware accelerators to improve performance and power properties. For large high-performance FPGAs to be successfully integrated in such computing systems, methods to raise the abstraction level of FPGA programming are required...... to identify optimal performance energy trade-offs points for a multicore based FPGA accelerator....

  3. Single-mode multicore fiber for dense space division multiplexing

    DEFF Research Database (Denmark)

    Sasaki, Yusuke; Amma, Yoshimichi; Takenaga, Katsuhiro

    2016-01-01

    Single-mode multicore fiber (SM-MCF) is attractive for high-capacity transmission. Our fabricated SM-MCFs achieve high core count and low crosstalk with a cladding diameter of 230 µm. Characteristics of fan-in/fan-out for the SM-MCFs are also investigated....

  4. Adaptive query parallelization in multi-core column stores

    NARCIS (Netherlands)

    M.M. Gawade (Mrunal); M.L. Kersten (Martin); M.M. Gawade (Mrunal); M.L. Kersten (Martin)

    2016-01-01

    htmlabstractWith the rise of multi-core CPU platforms, their optimal utilization for in-memory OLAP workloads using column store databases has become one of the biggest challenges. Some of the inherent limi- tations in the achievable query parallelism are due to the degree of parallelism

  5. Composable processor virtualization for embedded systems

    NARCIS (Netherlands)

    Molnos, A.M.; Milutinovic, A.; She, D.; Goossens, K.G.W.

    2010-01-01

    Processor virtualization divides a physical processor's time among a set of virual machines, enabling efficient hardware utilization, application security and allowing co-existence of different operating systems on the same processor. Through initially intended for the server domain, virtualization

  6. A programmable systolic trigger processor for FERA bus data

    International Nuclear Information System (INIS)

    Appelquist, G.; Hovander, B.; Sellden, B.; Bohm, C.

    1992-09-01

    A generic CAMAC based trigger processor module for fast processing of large amounts of ADC data, has been designed. This module has been realised using complex programmable gate arrays (LCAs from XILINX). The gate arrays have been connected to memories and multipliers in such a way that different gate array configurations can cover a wide range of module applications. Using this module, it is possible to construct complex trigger processors. The module uses both the fast ECL FERA bus and the CAMAC bus for inputs and outputs. The latter, however, is primarily used for set-up and control but may also be used for data output. Large numbers of ADCs can be served by a hierarchical arrangement of trigger processor modules, processing ADC data with pipe-line arithmetics producing the final result at the apex of the pyramid. The trigger decision will be transmitted to the data acquisition system via a logic signal while numeric results may be extracted by the CAMAC controller. The trigger processor was originally developed for the proposed neutral particle search experiment at CERN, NUMASS. There it was designed to serve as a second level trigger processor. It was required to correct all ADC raw data for efficiency and pedestal, calculate the total calorimeter energy, obtain the optimal time of flight data and calculate the particle mass. A suitable mass cut would then deliver the trigger decision. More complex triggers were also considered. (au)

  7. Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Dong, Fengguang [Univ. of Tennessee, Knoxville, TN (United States); Tomov, Stanimire [Univ. of Tennessee, Knoxville, TN (United States); Dongarra, Jack [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

    2011-06-01

    We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations e ciently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allocate data to the host system and GPUs to minimize communication. We have designed heterogeneous algorithms with two di erent tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scalability, strong scalability, load balance, and e ciency of our approach.

  8. MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

    Science.gov (United States)

    Gonzalez-Dominguez, Jorge; Martin, Maria J

    2017-10-10

    In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.

  9. Distributed processor systems

    International Nuclear Information System (INIS)

    Zacharov, B.

    1976-01-01

    In recent years, there has been a growing tendency in high-energy physics and in other fields to solve computational problems by distributing tasks among the resources of inter-coupled processing devices and associated system elements. This trend has gained further momentum more recently with the increased availability of low-cost processors and with the development of the means of data distribution. In two lectures, the broad question of distributed computing systems is examined and the historical development of such systems reviewed. An attempt is made to examine the reasons for the existence of these systems and to discern the main trends for the future. The components of distributed systems are discussed in some detail and particular emphasis is placed on the importance of standards and conventions in certain key system components. The ideas and principles of distributed systems are discussed in general terms, but these are illustrated by a number of concrete examples drawn from the context of the high-energy physics environment. (Auth.)

  10. Green Secure Processors: Towards Power-Efficient Secure Processor Design

    Science.gov (United States)

    Chhabra, Siddhartha; Solihin, Yan

    With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.

  11. Ring-array processor distribution topology for optical interconnects

    Science.gov (United States)

    Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

    1992-01-01

    The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.

  12. The associative memory system for the FTK processor at ATLAS

    CERN Document Server

    Magalotti, D; The ATLAS collaboration; Donati, S; Luciano, P; Piendibene, M; Giannetti, P; Lanza, A; Verzellesi, G; Sakellariou, Andreas; Billereau, W; Combe, J M

    2014-01-01

    In high energy physics experiments, the most interesting processes are very rare and hidden in an extremely large level of background. As the experiment complexity, accelerator backgrounds, and instantaneous luminosity increase, more effective and accurate data selection techniques are needed. The Fast TracKer processor (FTK) is a real time tracking processor designed for the ATLAS trigger upgrade. The FTK core is the Associative Memory system. It provides massive computing power to minimize the processing time of complex tracking algorithms executed online. This paper reports on the results and performance of a new prototype of Associative Memory system.

  13. Graphics processor efficiency for realization of rapid tabular computations

    International Nuclear Information System (INIS)

    Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.

    2016-01-01

    Capabilities of graphics processing units (GPU) and central processing units (CPU) have been investigated for realization of fast-calculation algorithms with the use of tabulated functions. The realization of tabulated functions is exemplified by the GPU/CPU architecture-based processors. Comparison is made between the operating efficiencies of GPU and CPU, employed for tabular calculations at different conditions of use. Recommendations are formulated for the use of graphical and central processors to speed up scientific and engineering computations through the use of tabulated functions

  14. Processors and systems (picture processing)

    Energy Technology Data Exchange (ETDEWEB)

    Gemmar, P

    1983-01-01

    Automatic picture processing requires high performance computers and high transmission capacities in the processor units. The author examines the possibilities of operating processors in parallel in order to accelerate the processing of pictures. He therefore discusses a number of available processors and systems for picture processing and illustrates their capacities for special types of picture processing. He stresses the fact that the amount of storage required for picture processing is exceptionally high. The author concludes that it is as yet difficult to decide whether very large groups of simple processors or highly complex multiprocessor systems will provide the best solution. Both methods will be aided by the development of VLSI. New solutions have already been offered (systolic arrays and 3-d processing structures) but they also are subject to losses caused by inherently parallel algorithms. Greater efforts must be made to produce suitable software for multiprocessor systems. Some possibilities for future picture processing systems are discussed. 33 references.

  15. Seismometer array station processors

    International Nuclear Information System (INIS)

    Key, F.A.; Lea, T.G.; Douglas, A.

    1977-01-01

    A description is given of the design, construction and initial testing of two types of Seismometer Array Station Processor (SASP), one to work with data stored on magnetic tape in analogue form, the other with data in digital form. The purpose of a SASP is to detect the short period P waves recorded by a UK-type array of 20 seismometers and to edit these on to a a digital library tape or disc. The edited data are then processed to obtain a rough location for the source and to produce seismograms (after optimum processing) for analysis by a seismologist. SASPs are an important component in the scheme for monitoring underground explosions advocated by the UK in the Conference of the Committee on Disarmament. With digital input a SASP can operate at 30 times real time using a linear detection process and at 20 times real time using the log detector of Weichert. Although the log detector is slower, it has the advantage over the linear detector that signals with lower signal-to-noise ratio can be detected and spurious large amplitudes are less likely to produce a detection. It is recommended, therefore, that where possible array data should be recorded in digital form for input to a SASP and that the log detector of Weichert be used. Trial runs show that a SASP is capable of detecting signals down to signal-to-noise ratios of about two with very few false detections, and at mid-continental array sites it should be capable of detecting most, if not all, the signals with magnitude above msub(b) 4.5; the UK argues that, given a suitable network, it is realistic to hope that sources of this magnitude and above can be detected and identified by seismological means alone. (author)

  16. Design of Networks-on-Chip for Real-Time Multi-Processor Systems-on-Chip

    DEFF Research Database (Denmark)

    Sparsø, Jens

    2012-01-01

    This paper addresses the design of networks-on-chips for use in multi-processor systems-on-chips - the hardware platforms used in embedded systems. These platforms typically have to guarantee real-time properties, and as the network is a shared resource, it has to provide service guarantees...... (bandwidth and/or latency) to different communication flows. The paper reviews some past work in this field and the lessons learned, and the paper discusses ongoing research conducted as part of the project "Time-predictable Multi-Core Architecture for Embedded Systems" (T-CREST), supported by the European...

  17. Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies

    OpenAIRE

    Linford, John Christian

    2010-01-01

    The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling...

  18. Multicore optical fiber grating array fabrication for medical sensing applications

    Science.gov (United States)

    Westbrook, Paul S.; Feder, K. S.; Kremp, T.; Taunay, T. F.; Monberg, E.; Puc, G.; Ortiz, R.

    2015-03-01

    In this work we report on a fiber grating fabrication platform suitable for parallel fabrication of Bragg grating arrays over arbitrary lengths of multicore optical fiber. Our system exploits UV transparent coatings and has precision fiber translation that allows for quasi-continuous grating fabrication. Our system is capable of both uniform and chirped fiber grating array spectra that can meet the demands of medical sensors including high speed, accuracy, robustness and small form factor.

  19. PERI - auto-tuning memory-intensive kernels for multicore

    International Nuclear Information System (INIS)

    Williams, S; Carter, J; Oliker, L; Shalf, J; Yelick, K; Bailey, D; Datta, K

    2008-01-01

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse matrix vector multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the high-performance computing literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4x improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications

  20. Polytopol computing for multi-core and distributed systems

    Science.gov (United States)

    Spaanenburg, Henk; Spaanenburg, Lambert; Ranefors, Johan

    2009-05-01

    Multi-core computing provides new challenges to software engineering. The paper addresses such issues in the general setting of polytopol computing, that takes multi-core problems in such widely differing areas as ambient intelligence sensor networks and cloud computing into account. It argues that the essence lies in a suitable allocation of free moving tasks. Where hardware is ubiquitous and pervasive, the network is virtualized into a connection of software snippets judiciously injected to such hardware that a system function looks as one again. The concept of polytopol computing provides a further formalization in terms of the partitioning of labor between collector and sensor nodes. Collectors provide functions such as a knowledge integrator, awareness collector, situation displayer/reporter, communicator of clues and an inquiry-interface provider. Sensors provide functions such as anomaly detection (only communicating singularities, not continuous observation), they are generally powered or self-powered, amorphous (not on a grid) with generation-and-attrition, field re-programmable, and sensor plug-and-play-able. Together the collector and the sensor are part of the skeleton injector mechanism, added to every node, and give the network the ability to organize itself into some of many topologies. Finally we will discuss a number of applications and indicate how a multi-core architecture supports the security aspects of the skeleton injector.

  1. Multi-core processing and scheduling performance in CMS

    CERN Multimedia

    CERN. Geneva

    2012-01-01

    Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-...

  2. PERI - Auto-tuning Memory Intensive Kernels for Multicore

    Energy Technology Data Exchange (ETDEWEB)

    Bailey, David H; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine; Bailey, David H

    2008-06-24

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.

  3. Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine

    2008-02-01

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

  4. Lattice Boltzmann simulation optimization on leading multicore platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, S. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Carter, J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Oliker, L. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Shalf, J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelick, K. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)

    2008-01-01

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

  5. Making CSB + -Trees Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose......Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  6. Java Processor Optimized for RTSJ

    Directory of Open Access Journals (Sweden)

    Tu Shiliang

    2007-01-01

    Full Text Available Due to the preeminent work of the real-time specification for Java (RTSJ, Java is increasingly expected to become the leading programming language in real-time systems. To provide a Java platform suitable for real-time applications, a Java processor which can execute Java bytecode is directly proposed in this paper. It provides efficient support in hardware for some mechanisms specified in the RTSJ and offers a simpler programming model through ameliorating the scoped memory of the RTSJ. The worst case execution time (WCET of the bytecodes implemented in this processor is predictable by employing the optimization method proposed in our previous work, in which all the processing interfering predictability is handled before bytecode execution. Further advantage of this method is to make the implementation of the processor simpler and suited to a low-cost FPGA chip.

  7. OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

    Science.gov (United States)

    Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

    OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API

  8. Optical Array Processor: Laboratory Results

    Science.gov (United States)

    Casasent, David; Jackson, James; Vaerewyck, Gerard

    1987-01-01

    A Space Integrating (SI) Optical Linear Algebra Processor (OLAP) is described and laboratory results on its performance in several practical engineering problems are presented. The applications include its use in the solution of a nonlinear matrix equation for optimal control and a parabolic Partial Differential Equation (PDE), the transient diffusion equation with two spatial variables. Frequency-multiplexed, analog and high accuracy non-base-two data encoding are used and discussed. A multi-processor OLAP architecture is described and partitioning and data flow issues are addressed.

  9. Towards industry strength mapping of AUTOSAR automotive functionality on multicore architectures

    DEFF Research Database (Denmark)

    Avasalcai, Cosmin Florin; Budhrani, Dhanesh; Pop, Paul

    2017-01-01

    The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, consisting of several multicore ECUs. In addition, multicore ECUs are being adopted because of better performance,...... engineer in the mapping task. We have successfully evaluated AUTOMAP on several realistic use cases from Volvo Trucks....

  10. Novel Crosstalk Measurement Method for Multi-Core Fiber Fan-In/Fan-Out Devices

    DEFF Research Database (Denmark)

    Ye, Feihong; Ono, Hirotaka; Abe, Yoshiteru

    2016-01-01

    We propose a new crosstalk measurement method for multi-core fiber fan-in/fan-out devices utilizing the Fresnel reflection. Compared with the traditional method using core-to-core coupling between a multi-core fiber and a single-mode fiber, the proposed method has the advantages of high reliability...

  11. An object-oriented bulk synchronous parallel library for multicore programming

    NARCIS (Netherlands)

    Yzelman, A.N.; Bisseling, R.H.

    2012-01-01

    We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has

  12. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Science.gov (United States)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2017-08-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  13. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Directory of Open Access Journals (Sweden)

    Cerati Giuseppe

    2017-01-01

    Full Text Available For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU, ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC, for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  14. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Energy Technology Data Exchange (ETDEWEB)

    Cerati, Giuseppe [Fermilab; Elmer, Peter [Princeton U.; Krutelyov, Slava [UC, San Diego; Lantz, Steven [Cornell U.; Lefebvre, Matthieu [Princeton U.; Masciovecchio, Mario [UC, San Diego; McDermott, Kevin [Cornell U.; Riley, Daniel [Cornell U., LNS; Tadel, Matevž [UC, San Diego; Wittich, Peter [Cornell U.; Würthwein, Frank [UC, San Diego; Yagil, Avi [UC, San Diego

    2017-01-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  15. Very Long Instruction Word Processors

    Indian Academy of Sciences (India)

    Explicitly Parallel Instruction Computing (EPIC) is an instruction processing paradigm that has been in the spot- light due to its adoption by the next generation of Intel. Processors starting with the IA-64. The EPIC processing paradigm is an evolution of the Very Long Instruction. Word (VLIW) paradigm. This article gives an ...

  16. A parallel and sensitive software tool for methylation analysis on multicore platforms.

    Science.gov (United States)

    Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

    2015-10-01

    DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  17. Exploring Heterogeneous Multicore Architectures for Advanced Embedded Uncertainty Quantification.

    Energy Technology Data Exchange (ETDEWEB)

    Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.

    2014-09-01

    We explore rearrangements of classical uncertainty quantification methods with the aim of achieving higher aggregate performance for uncertainty quantification calculations on emerging multicore and manycore architectures. We show a rearrangement of the stochastic Galerkin method leads to improved performance and scalability on several computational architectures whereby un- certainty information is propagated at the lowest levels of the simulation code improving memory access patterns, exposing new dimensions of fine grained parallelism, and reducing communica- tion. We also develop a general framework for implementing such rearrangements for a diverse set of uncertainty quantification algorithms as well as computational simulation codes to which they are applied.

  18. Evaluation of SuperLU on multicore architectures

    International Nuclear Information System (INIS)

    Li, X S

    2008-01-01

    The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior

  19. Addressing the challenges of standalone multi-core simulations in molecular dynamics

    Science.gov (United States)

    Ocaya, R. O.; Terblans, J. J.

    2017-07-01

    Computational modelling in material science involves mathematical abstractions of force fields between particles with the aim to postulate, develop and understand materials by simulation. The aggregated pairwise interactions of the material's particles lead to a deduction of its macroscopic behaviours. For practically meaningful macroscopic scales, a large amount of data are generated, leading to vast execution times. Simulation times of hours, days or weeks for moderately sized problems are not uncommon. The reduction of simulation times, improved result accuracy and the associated software and hardware engineering challenges are the main motivations for many of the ongoing researches in the computational sciences. This contribution is concerned mainly with simulations that can be done on a "standalone" computer based on Message Passing Interfaces (MPI), parallel code running on hardware platforms with wide specifications, such as single/multi- processor, multi-core machines with minimal reconfiguration for upward scaling of computational power. The widely available, documented and standardized MPI library provides this functionality through the MPI_Comm_size (), MPI_Comm_rank () and MPI_Reduce () functions. A survey of the literature shows that relatively little is written with respect to the efficient extraction of the inherent computational power in a cluster. In this work, we discuss the main avenues available to tap into this extra power without compromising computational accuracy. We also present methods to overcome the high inertia encountered in single-node-based computational molecular dynamics. We begin by surveying the current state of the art and discuss what it takes to achieve parallelism, efficiency and enhanced computational accuracy through program threads and message passing interfaces. Several code illustrations are given. The pros and cons of writing raw code as opposed to using heuristic, third-party code are also discussed. The growing trend

  20. VON WISPR Family Processors: Volume 1

    National Research Council Canada - National Science Library

    Wagstaff, Ronald

    1997-01-01

    ...) and the background noise they are embedded in. Processors utilizing those fluctuations such as the von WISPR Family Processors discussed herein, are methods or algorithms that preferentially attenuate the fluctuating signals and noise...

  1. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...

  2. Debugging multi-core systems-on-chip

    NARCIS (Netherlands)

    Vermeulen, B.; Goossens, K.G.W.; Kornaros, G.

    2010-01-01

    In this chapter, we introduced three fundamental reasons why debugging a multi-processor SoC is intrinsically difficult; (1) limited internal observability, (2) asynchronicity, and (3) non-determinism.

  3. Deterministic chaos in the processor load

    International Nuclear Information System (INIS)

    Halbiniak, Zbigniew; Jozwiak, Ireneusz J.

    2007-01-01

    In this article we present the results of research whose purpose was to identify the phenomenon of deterministic chaos in the processor load. We analysed the time series of the processor load during efficiency tests of database software. Our research was done on a Sparc Alpha processor working on the UNIX Sun Solaris 5.7 operating system. The conducted analyses proved the presence of the deterministic chaos phenomenon in the processor load in this particular case

  4. Fast parallel ring recognition algorithm in the RICH detector of the CBM experiment at FAIR

    International Nuclear Information System (INIS)

    Lebedev, S.

    2011-01-01

    The Compressed Baryonic Matter (CBM)experiment at the future FAIR facility at Darmstadt will measure dileptons emitted from the hot and dense phase in heavy ion collisions. In case of an electron measurement, a high purity of identified electrons is required in order to suppress the background. Electron identification in CBM will be performed by a Ring Imaging Cherenkov (RICH) detector and Transition Radiation Detector (TRD). Very fast data reconstruction is extremely important for CBM because of the huge amount of data which has to be handled. In this contribution, a parallelized ring recognition algorithm is presented. Modern CPUs have two features, which enable parallel programming. First, the SSE technology allows using the SIMD execution model. Second, multicore CPUs enable the use of multithreading. Both features have been implemented in the ring reconstruction of the RICH detector. A considerable speedup factor from 357 to 2.5 ms/event has been achieved including preceding code optimization for Intel Xeon X5550 processors at 2.67 GHz

  5. A survey of Tumult, a real-time multi-processor system

    International Nuclear Information System (INIS)

    Jansen, P.G.

    1986-01-01

    Tumult (Twente University MULTi processor system) is the name of an ongoing project aiming at the design and implementation of a modular extendible multiprocessor system. All memory is distributed and processors communicate in parallel via a fast and reliable local switching network instead of a shared bus. A distributed real-time operating system is being designed and implemented, consisting of a multi-tasking subsystem per processor. Processes can communicate via a message passing mechanism. Communication links and processes are dynamically created and disposed by the application. In this article a brief description of the system is given; communication aspects are emphasized. (Auth.)

  6. JPP: A Java Pre-Processor

    OpenAIRE

    Kiniry, Joseph R.; Cheong, Elaine

    1998-01-01

    The Java Pre-Processor, or JPP for short, is a parsing pre-processor for the Java programming language. Unlike its namesake (the C/C++ Pre-Processor, cpp), JPP provides functionality above and beyond simple textual substitution. JPP's capabilities include code beautification, code standard conformance checking, class and interface specification and testing, and documentation generation.

  7. 3D-LIN: A Configurable Low-Latency Interconnect for Multi-Core Clusters with 3D Stacked L1 Memory

    OpenAIRE

    Beanato, Giulia; Loi, Igor; De Micheli, Giovanni; Leblebici, Yusuf; Benini, Luca

    2012-01-01

    Shared L1 memories are of interest for tightly- coupled processor clusters in programmable accelerators as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of a shared-L1 memory critically depends on the architecture of the low-latency interconnect between processors and memory banks, which needs to provide ultra-fast access to the largest possible L1 working set. The advent of 3D technology provides new opportunities to improve the...

  8. Parallel Computation on Multicore Processors Using Explicit Form of the Finite Element Method and C++ Standard Libraries

    Directory of Open Access Journals (Sweden)

    Rek Václav

    2016-11-01

    Full Text Available In this paper, the form of modifications of the existing sequential code written in C or C++ programming language for the calculation of various kind of structures using the explicit form of the Finite Element Method (Dynamic Relaxation Method, Explicit Dynamics in the NEXX system is introduced. The NEXX system is the core of engineering software NEXIS, Scia Engineer, RFEM and RENEX. It has the possibilities of multithreaded running, which can now be supported at the level of native C++ programming language using standard libraries. Thanks to the high degree of abstraction that a contemporary C++ programming language provides, a respective library created in this way can be very generalized for other purposes of usage of parallelism in computational mechanics.

  9. Integrated development environment for multi-core systems

    Directory of Open Access Journals (Sweden)

    Krunić Momčilo V.

    2014-01-01

    Full Text Available Development of the software application that provides comfortable working environment of embedded software applications was always a difficult task to achieve. To reach this goal it was necessary to integrate all specific tools designed for that purpose. This paper describes Integrated Development Environment (IDE that was developed to meet all specific needs of a software development for the family of multi-core target platforms designed for a digital signal processing in Cirrus Logic Company. Eclipse platform and RCP (Rich Client Platform was used as a basis, because it provides an extensible plug-in system for customizing the development environment. CLIDE (Cirrus Logic Integrated Development Environment represent the epilog of that effort, reliable IDE used for development of embedded applications. Validation of the solution is accomplished thru 2641 J Unit tests that validate most of the CLIDE's functionalities. Developed IDE (CLIDE significantly increases a quality of a software development for multi-core systems and reduces time-to-market, thereby justifying development costs.

  10. Neural networks within multi-core optic fibers.

    Science.gov (United States)

    Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

    2016-07-07

    Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.

  11. OpenCL Performance Evaluation on Modern Multicore CPUs

    Directory of Open Access Journals (Sweden)

    Joo Hwan Lee

    2015-01-01

    Full Text Available Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.

  12. Real-time trajectory optimization on parallel processors

    Science.gov (United States)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  13. Online Fastbus processor for LEP

    International Nuclear Information System (INIS)

    Mueller, H.

    1986-01-01

    The author describes the online computing aspects of Fastbus systems using a processor module which has been developed at CERN and is now available commercially. These General Purpose Master/Slaves (GPMS) are based on 68000/10 (or optionally 68020/68881) processors. Applications include use as event-filters (DELPHI), supervisory controllers, Fastbus stand-alone diagnostic tools, and multiprocessor array components. The direct mapping of single, 32-bit assembly instructions to execute Fastbus protocols makes the use of a GPM both simple and flexible. Loosely coupled processing in Fastbus networks is possible between GPM's as they support access semaphores and use a two port memory as I/O buffer for Fastbus. Both master and slave-ports support block transfers up to 20 Mbytes/s. The CERN standard Fastbus software and the MoniCa symbolic debugging monitor are available on the GPM with real time, multiprocessing support. (Auth.)

  14. Invasive tightly coupled processor arrays

    CERN Document Server

    LARI, VAHID

    2016-01-01

    This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...

  15. Measuring Performance of Soft Real-Time Tasks on Multi-core Systems

    OpenAIRE

    Rafiq, Salman

    2011-01-01

    Multi-core platforms are well established, and they are slowly moving into the area of embedded and real-time systems. Nowadays to take advantage of multi-core systems in terms of throughput, soft real-time applications are run together with general purpose applications under an operating system such as Linux. But due to shared hardware resources in multi-core architectures, it is likely that these applications will interfere and compete with each other. This can cause slower response times f...

  16. On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

    Directory of Open Access Journals (Sweden)

    Marco Aldinucci

    2014-01-01

    Full Text Available The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.

  17. On designing multicore-aware simulators for systems biology endowed with OnLine statistics.

    Science.gov (United States)

    Aldinucci, Marco; Calcagno, Cristina; Coppo, Mario; Damiani, Ferruccio; Drocco, Maurizio; Sciacca, Eva; Spinella, Salvatore; Torquati, Massimo; Troina, Angelo

    2014-01-01

    The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.

  18. A VLSI image processor via pseudo-mersenne transforms

    International Nuclear Information System (INIS)

    Sei, W.J.; Jagadeesh, J.M.

    1986-01-01

    The computational burden on image processing in medical fields where a large amount of information must be processed quickly and accurately has led to consideration of special-purpose image processor chip design for some time. The very large scale integration (VLSI) resolution has made it cost-effective and feasible to consider the design of special purpose chips for medical imaging fields. This paper describes a VLSI CMOS chip suitable for parallel implementation of image processing algorithms and cyclic convolutions by using Pseudo-Mersenne Number Transform (PMNT). The main advantages of the PMNT over the Fast Fourier Transform (FFT) are: (1) no multiplications are required; (2) integer arithmetic is used. The design and development of this processor, which operates on 32-point convolution or 5 x 5 window image, are described

  19. Digital implementation of the preloaded filter pulse processor

    International Nuclear Information System (INIS)

    Westphal, G.P.; Cadek, G.R.; Keroe, N.; Sauter, TH.; Thorwartl, P.C.

    1995-01-01

    Adapting it's processing time to the respective pulse intervals, the Preloaded Filter (PLF) pulse processor offers optimum resolution together with highest possible throughput rates. The PLF algorithm could be formulated in a recursive manner which made possible it's implementation by means of a large field-programmable gate array, as a fast, pipe-lined digital processor with 10 MHz maximum throughput rate. While pre-filter digitization by an ADC with 12 bit resolution and 10M Hz sampling rate resulted in a poorer resolution than that of an analog filter, a digital PLF based on an ADC with 14 bit resolution and 10 MHz sampling rate, surpassed high-quality analog filters in resolution, throughput rate and long-term stability. (author) 6 refs.; 7 figs

  20. Single-shot polarimetry imaging of multicore fiber.

    Science.gov (United States)

    Sivankutty, Siddharth; Andresen, Esben Ravn; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hervé

    2016-05-01

    We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element, together with an analyzer, and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single-mode fiber cores in both conventional and polarization-maintaining fiber bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual fiber core, including eigen-polarization states, phase delay, and diattenuation.

  1. Extended depth of field imaging through multicore optical fibers.

    Science.gov (United States)

    Orth, Antony; Ploschner, Martin; Maksymov, Ivan S; Gibson, Brant C

    2018-03-05

    Compact microendoscopes use multicore optical fibers (MOFs) to visualize hard-to-reach regions of the body. These devices typically have a large numerical aperture (NA) and are fixed-focus, leading to blurry images from a shallow depth of field with little focus control. In this work, we demonstrate a method to digitally adjust the collection aperture and therefore extend the depth of field of lensless MOF imaging probes. We show that the depth of field can be more than doubled for certain spatial frequencies, and observe a resolution enhancement of up to 78% at a distance of 50μm from the MOF facet. Our technique enables imaging of complex 3D objects at a comparable working distance to lensed MOFs, but without the requirement of lenses, scan units or transmission matrix calibration. Our approach is implemented in post processing and may be used to improve contrast in any microendoscopic probe utilizing a MOF and incoherent light.

  2. Sparse Matrix-Vector Multiplication on Multicore and Accelerators

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel W. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Bell, Nathan [NVIDIA Research, Santa Clara, CA (United States); Choi, Jee Whan [Georgia Inst. of Technology, Atlanta, GA (United States); Garland, Michael [NVIDIA Research, Santa Clara, CA (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Vuduc, Richard [Georgia Inst. of Technology, Atlanta, GA (United States)

    2010-12-07

    This chapter consolidates recent work on the development of high performance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for two key reasons. First, it appears widely in applications in scientific and engineering computing, financial and economic modeling, and information retrieval, among others, and is therefore of great practical interest. Secondly, it is both simple to describe but challenging to implement well, since its performance is limited by a variety of factors, including low computational intensity, potentially highly irregular memory access behavior, and a strong input dependence that be known only at run time. Thus, we believe SpMV is both practically important and provides important insights for understanding the algorithmic and implementation principles necessary to making effective use of state-of-the-art systems.

  3. Scalable Multi-core Architectures Design Methodologies and Tools

    CERN Document Server

    Jantsch, Axel

    2012-01-01

    As Moore’s law continues to unfold, two important trends have recently emerged. First, the growth of chip capacity is translated into a corresponding increase of number of cores. Second, the parallalization of the computation and 3D integration technologies lead to distributed memory architectures. This book provides a current snapshot of industrial and academic research, conducted as part of the European FP7 MOSART project, addressing urgent challenges in many-core architectures and application mapping.  It addresses the architectural design of many core chips, memory and data management, power management, design and programming methodologies. It also describes how new techniques have been applied in various industrial case studies. Describes trends towards distributed memory architectures and distributed power management; Integrates Network on Chip with distributed, shared memory architectures; Demonstrates novel design methodologies and frameworks for multi-core design space exploration; Shows how midll...

  4. Quantum key distribution over multicore fiber based on silicon photonics

    DEFF Research Database (Denmark)

    Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld

    on quantum physics. In order to exchange secure information between users, quantum key distribution (QKD), a branch of Quantum Communications (QCs), provides good prospects for ultimate security based on the laws of quantum mechanics [2–7]. Most of QKD systems are implemented in a point-to-point link using...... generations, to HD-entanglement distribution. Furthermore, MCFs are expected as a good candidate for overcoming the capacity limit of a current optical communication system, as example the record capacity of 661 Tbits/s was obtained last year with a 30-cores fiber [8]. Proof of concept experiment has already...... requirements in terms of key generation are needed. A solution may be represented by new technologies applied to quantum world. In particular multicore fiber (MCF) open a new scenario for quantum communications, from high-dimensional (HD) spatial entanglement generation, to HD QKD and multi-user key...

  5. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures

    KAUST Repository

    Ltaief, Hatem; Luszczek, Piotr R.; Haidar, Azzam; Dongarra, Jack

    2012-01-01

    This paper proposes an efficient implementation of the generalized symmetric eigenvalue problem on multicore architecture. Based on a four-stage approach and tile algorithms, the original problem is first transformed into a standard symmetric

  6. Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

    KAUST Repository

    Ltaief, Hatem; Gratadour, Damien; Charara, Ali; Gendron, Eric

    2016-01-01

    We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used

  7. Photonic bandgap fiber lasers and multicore fiber lasers for next generation high power lasers

    DEFF Research Database (Denmark)

    Shirakawa, A.; Chen, M.; Suzuki, Y.

    2014-01-01

    Photonic bandgap fiber lasers are realizing new laser spectra and nonlinearity mitigation that a conventional fiber laser cannot. Multicore fiber lasers are a promising tool for power scaling by coherent beam combination. © 2014 OSA....

  8. Automated Parallel Computing Tools for Multicore Machines and Clusters, Phase I

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose to improve productivity of high performance computing for applications on multicore computers and clusters. These machines built from one or more chips...

  9. Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

    KAUST Repository

    Ltaief, Hatem; Luszczek, Piotr R.; Dongarra, Jack

    2012-01-01

    The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using

  10. Wavelength-dependent Crosstalk in Trench-Assisted Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

    2014-01-01

    Analytical expressions for wavelength-dependent crosstalk in homogeneous trench-assisted multi-core fibers are derived. The calculated results from the expressions agree well with the numerical simulation results based on finite element method.......Analytical expressions for wavelength-dependent crosstalk in homogeneous trench-assisted multi-core fibers are derived. The calculated results from the expressions agree well with the numerical simulation results based on finite element method....

  11. Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

    Science.gov (United States)

    2014-04-01

    Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented

  12. Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

    Science.gov (United States)

    Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

    2012-01-01

    We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529

  13. Accuracies Of Optical Processors For Adaptive Optics

    Science.gov (United States)

    Downie, John D.; Goodman, Joseph W.

    1992-01-01

    Paper presents analysis of accuracies and requirements concerning accuracies of optical linear-algebra processors (OLAP's) in adaptive-optics imaging systems. Much faster than digital electronic processor and eliminate some residual distortion. Question whether errors introduced by analog processing of OLAP overcome advantage of greater speed. Paper addresses issue by presenting estimate of accuracy required in general OLAP that yields smaller average residual aberration of wave front than digital electronic processor computing at given speed.

  14. Functional Verification of Enhanced RISC Processor

    OpenAIRE

    SHANKER NILANGI; SOWMYA L

    2013-01-01

    This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...

  15. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie; Vega, Leticia; Adam, Niklas; Flynn, Michael; Wjee (er. Rau); Lunn, Griffin; Jackson, Andrew

    2012-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrogen and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  16. Data register and processor for multiwire chambers

    International Nuclear Information System (INIS)

    Karpukhin, V.V.

    1985-01-01

    A data register and a processor for data receiving and processing from drift chambers of a device for investigating relativistic positroniums are described. The data are delivered to the register input in the form of the Grey 8 bit code, memorized and transformed to a position code. The register information is delivered to the KAMAK trunk and to the front panel plug. The processor selects particle tracks in a horizontal plane of the facility. ΔY maximum coordinate divergence and minimum point quantity on the track are set from the processor front panel. Processor solution time is 16 μs maximum quantity of simultaneously analyzed coordinates is 16

  17. Many - body simulations using an array processor

    International Nuclear Information System (INIS)

    Rapaport, D.C.

    1985-01-01

    Simulations of microscopic models of water and polypeptides using molecular dynamics and Monte Carlo techniques have been carried out with the aid of an FPS array processor. The computational techniques are discussed, with emphasis on the development and optimization of the software to take account of the special features of the processor. The computing requirements of these simulations exceed what could be reasonably carried out on a normal 'scientific' computer. While the FPS processor is highly suited to the kinds of models described, several other computationally intensive problems in statistical mechanics are outlined for which alternative processor architectures are more appropriate

  18. Sensitometric control of roentgen film processors

    International Nuclear Information System (INIS)

    Forsberg, H.; Karolinska Sjukhuset, Stockholm

    1987-01-01

    Monitoring of film processors performance is essential since image quality, patient dose and costs are influenced by the performance. A system for sensitometric constancy control of film processors and their associated components is described. Experience with the system for 3 years is given when implemented on 17 film processors. Modern high quality film processors have a stability that makes a test frequency of once a week sufficient to maintain adequate image quality. The test system is so sensitive that corrective actions almost invariably have been taken before any technical problem degraded the image quality to a visible degree. (orig.)

  19. Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

    Directory of Open Access Journals (Sweden)

    Masa-aki FUKASE

    2007-12-01

    Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.

  20. Tuning iteration space slicing based tiled multi-core code implementing Nussinov's RNA folding.

    Science.gov (United States)

    Palkowski, Marek; Bielecki, Wlodzimierz

    2018-01-15

    parallel tiled code implementing Nussinov's RNA folding. Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500.

  1. Fast semivariogram computation using FPGA architectures

    Science.gov (United States)

    Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang

    2015-02-01

    . Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices

  2. Producing chopped firewood with firewood processors

    International Nuclear Information System (INIS)

    Kaerhae, K.; Jouhiaho, A.

    2009-01-01

    The TTS Institute's research and development project studied both the productivity of new, chopped firewood processors (cross-cutting and splitting machines) suitable for professional and independent small-scale production, and the costs of the chopped firewood produced. Seven chopped firewood processors were tested in the research, six of which were sawing processors and one shearing processor. The chopping work was carried out using wood feeding racks and a wood lifter. The work was also carried out without any feeding appliances. Altogether 132.5 solid m 3 of wood were chopped in the time studies. The firewood processor used had the most significant impact on chopping work productivity. In addition to the firewood processor, the stem mid-diameter, the length of the raw material, and of the firewood were also found to affect productivity. The wood feeding systems also affected productivity. If there is a feeding rack and hydraulic grapple loader available for use in chopping firewood, then it is worth using the wood feeding rack. A wood lifter is only worth using with the largest stems (over 20 cm mid-diameter) if a feeding rack cannot be used. When producing chopped firewood from small-diameter wood, i.e. with a mid-diameter less than 10 cm, the costs of chopping work were over 10 EUR solid m -3 with sawing firewood processors. The shearing firewood processor with a guillotine blade achieved a cost level of 5 EUR solid m -3 when the mid-diameter of the chopped stem was 10 cm. In addition to the raw material, the cost-efficient chopping work also requires several hundred annual operating hours with a firewood processor, which is difficult for individual firewood entrepreneurs to achieve. The operating hours of firewood processors can be increased to the required level by the joint use of the processors by a number of firewood entrepreneurs. (author)

  3. Application of digital beam position processor Libera on tune measurement

    International Nuclear Information System (INIS)

    Zhang Chunhui; Sun Baogen; Cao Yong; Lu Ping; Li Jihao

    2006-01-01

    Digital signal processing (DSP) is widely used in the field of beam diagnostics. Especially, DSP achieves very good performance in beam position signal analysis and betatron tune measurement. In Hefei light source, when beam was excited by narrow-band Gaussian white nose, Libera, a digital beam position processor, was used to process the signals from beam position monitor (BPM), which contained betatron oscillation. Fast Fourier transform (FFT) was applied to finding out betatron resonance frequency, from which the decimal part of betatron oscillation tune was calculated. By this means, the measure of horizontal tune was 3.5352 and the measure of vertical tune is 2.6299. (authors)

  4. The Associative Memory Boards for the FTK Processor at ATLAS

    CERN Document Server

    Calabro, D; The ATLAS collaboration; Citraro, S; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    The Associative Memory (AM) system, the main part of the FastTracker (FTK) processor, is designed to perform pattern matching using the information of the silicon tracking detectors. It finds track candidates at low resolution that are seeds for the following step performing precise track fitting. The system has to support challenging data traffic, handled by a group of modern low cost FPGAs, the Xilinx Spartan6 chips, which have Low-Power Gigabit Transceivers (GTP). Each GTP transceiver is a combined transmitter and receiver capable of operating at data rates up to 3.2 Gb/s. \

  5. Micro processors for plant protection

    International Nuclear Information System (INIS)

    McAffer, N.T.C.

    1976-01-01

    Micro computers can be used satisfactorily in general protection duties with economic advantages over hardwired systems. The reliability of such protection functions can be enhanced by keeping the task performed by each protection micro processor simple and by avoiding such a task being dependent on others in any substantial way. This implies that vital work done for any task is kept within it and that any communications from it to outside or to it from outside are restricted to those for controlling data transfer. Also that the amount of this data should be the minimum consistent with satisfactory task execution. Technology is changing rapidly and devices may become obsolete and be supplanted by new ones before their theoretical reliability can be confirmed or otherwise by field service. This emphasises the need for users to pool device performance data so that effective reliability judgements can be made within the lifetime of the devices. (orig.) [de

  6. Development and validation of a two-dimensional fast-response flood estimation model

    Energy Technology Data Exchange (ETDEWEB)

    Judi, David R [Los Alamos National Laboratory; Mcpherson, Timothy N [Los Alamos National Laboratory; Burian, Steven J [UNIV OF UTAK

    2009-01-01

    A finite difference formulation of the shallow water equations using an upwind differencing method was developed maintaining computational efficiency and accuracy such that it can be used as a fast-response flood estimation tool. The model was validated using both laboratory controlled experiments and an actual dam breach. Through the laboratory experiments, the model was shown to give good estimations of depth and velocity when compared to the measured data, as well as when compared to a more complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies show that a relatively numerical scheme used to solve the complete shallow water equations can be used to accurately estimate flood inundation. Future work will focus on further reducing the computation time needed to provide flood inundation estimates for fast-response analyses. This will be accomplished through the efficient use of multi-core, multi-processor computers coupled with an efficient domain-tracking algorithm, as well as an understanding of the impacts of grid resolution on model results.

  7. Towards a Process Algebra for Shared Processors

    DEFF Research Database (Denmark)

    Buchholtz, Mikael; Andersen, Jacob; Løvengreen, Hans Henrik

    2002-01-01

    We present initial work on a timed process algebra that models sharing of processor resources allowing preemption at arbitrary points in time. This enables us to model both the functional and the timely behaviour of concurrent processes executed on a single processor. We give a refinement relation...

  8. Vector and parallel processors in computational science

    International Nuclear Information System (INIS)

    Duff, I.S.; Reid, J.K.

    1985-01-01

    These proceedings contain the articles presented at the named conference. These concern hardware and software for vector and parallel processors, numerical methods and algorithms for the computation on such processors, as well as applications of such methods to different fields of physics and related sciences. See hints under the relevant topics. (HSI)

  9. The communication processor of TUMULT-64

    NARCIS (Netherlands)

    Smit, Gerardus Johannes Maria; Jansen, P.G.

    1988-01-01

    Tumult (Twente University MULTi-processor system) is a modular extendible multi-processor system designed and implemented at the Twente University of Technology in co-operation with Oce Nederland B.V. and the Dr. Neher Laboratories (Dutch PTT). Characteristics of the hardware are: MIMD type,

  10. An interactive parallel processor for data analysis

    International Nuclear Information System (INIS)

    Mong, J.; Logan, D.; Maples, C.; Rathbun, W.; Weaver, D.

    1984-01-01

    A parallel array of eight minicomputers has been assembled in an attempt to deal with kiloparameter data events. By exporting computer system functions to a separate processor, the authors have been able to achieve computer amplification linearly proportional to the number of executing processors

  11. A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

    Science.gov (United States)

    Zhang, Zhen; Ma, Cheng; Zhu, Rong

    2017-08-23

    Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.

  12. A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System

    Directory of Open Access Journals (Sweden)

    Zhen Zhang

    2017-08-01

    Full Text Available Artificial Neural Networks (ANNs, including Deep Neural Networks (DNNs, have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP. The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.

  13. Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

    OpenAIRE

    Abdul Kareem PARCHUR; Ram Asaray SINGH

    2012-01-01

    High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310). The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many ke...

  14. Neurovision processor for designing intelligent sensors

    Science.gov (United States)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  15. Development of a highly reliable CRT processor

    International Nuclear Information System (INIS)

    Shimizu, Tomoya; Saiki, Akira; Hirai, Kenji; Jota, Masayoshi; Fujii, Mikiya

    1996-01-01

    Although CRT processors have been employed by the main control board to reduce the operator's workload during monitoring, the control systems are still operated by hardware switches. For further advancement, direct controller operation through a display device is expected. A CRT processor providing direct controller operation must be as reliable as the hardware switches are. The authors are developing a new type of highly reliable CRT processor that enables direct controller operations. In this paper, we discuss the design principles behind a highly reliable CRT processor. The principles are defined by studies of software reliability and of the functional reliability of the monitoring and operation systems. The functional configuration of an advanced CRT processor is also addressed. (author)

  16. Computer Generated Inputs for NMIS Processor Verification

    International Nuclear Information System (INIS)

    J. A. Mullens; J. E. Breeding; J. A. McEvers; R. W. Wysor; L. G. Chiang; J. R. Lenarduzzi; J. T. Mihalczo; J. K. Mattingly

    2001-01-01

    Proper operation of the Nuclear Identification Materials System (NMIS) processor can be verified using computer-generated inputs [BIST (Built-In-Self-Test)] at the digital inputs. Preselected sequences of input pulses to all channels with known correlation functions are compared to the output of the processor. These types of verifications have been utilized in NMIS type correlation processors at the Oak Ridge National Laboratory since 1984. The use of this test confirmed a malfunction in a NMIS processor at the All-Russian Scientific Research Institute of Experimental Physics (VNIIEF) in 1998. The NMIS processor boards were returned to the U.S. for repair and subsequently used in NMIS passive and active measurements with Pu at VNIIEF in 1999

  17. Use of Digital Signal Processors (DSP) in high energy physics experiments

    International Nuclear Information System (INIS)

    Crosetto, D.

    1988-01-01

    The FDDP - Fast Digital Data Processor - is a modular system for executing parallel digital processing algorithms to perform programmable trigger decisions or programmable on-line data reduction. Typical application involve zero suppression and pulse shape analysis. The characteristics of the system are: modularity, expandability and flexibility. (author). 4 refs, 5 figs

  18. Multicore fibre photonic lanterns for precision radial velocity Science

    Science.gov (United States)

    Gris-Sánchez, Itandehui; Haynes, Dionne M.; Ehrlich, Katjana; Haynes, Roger; Birks, Tim A.

    2018-04-01

    Incomplete fibre scrambling and fibre modal noise can degrade high-precision spectroscopic applications (typically high spectral resolution and high signal to noise). For example, it can be the dominating error source for exoplanet finding spectrographs, limiting the maximum measurement precision possible with such facilities. This limitation is exacerbated in the next generation of infra-red based systems, as the number of modes supported by the fibre scales inversely with the wavelength squared and more modes typically equates to better scrambling. Substantial effort has been made by major research groups in this area to improve the fibre link performance by employing non-circular fibres, double scramblers, fibre shakers, and fibre stretchers. We present an original design of a multicore fibre (MCF) terminated with multimode photonic lantern ports. It is designed to act as a relay fibre with the coupling efficiency of a multimode fibre (MMF), modal stability similar to a single-mode fibre and low loss in a wide range of wavelengths (380 nm to 860 nm). It provides phase and amplitude scrambling to achieve a stable near field and far-field output illumination pattern despite input coupling variations, and low modal noise for increased stability for high signal-to-noise applications such as precision radial velocity (PRV) science. Preliminary results are presented for a 511-core MCF and compared with current state of the art octagonal fibre.

  19. Analytical Bounds on the Threads in IXP1200 Network Processor

    OpenAIRE

    Ramakrishna, STGS; Jamadagni, HS

    2003-01-01

    Increasing link speeds have placed enormous burden on the processing requirements and the processors are expected to carry out a variety of tasks. Network Processors (NP) [1] [2] is the blanket name given to the processors, which are traded for flexibility and performance. Network Processors are offered by a number of vendors; to take the main burden of processing requirement of network related operations from the conventional processors. The Network Processors cover a spectrum of design trad...

  20. Performance modeling and analysis of parallel Gaussian elimination on multi-core computers

    Directory of Open Access Journals (Sweden)

    Fadi N. Sibai

    2014-01-01

    Full Text Available Gaussian elimination is used in many applications and in particular in the solution of systems of linear equations. This paper presents mathematical performance models and analysis of four parallel Gaussian Elimination methods (precisely the Original method and the new Meet in the Middle –MiM– algorithms and their variants with SIMD vectorization on multi-core systems. Analytical performance models of the four methods are formulated and presented followed by evaluations of these models with modern multi-core systems’ operation latencies. Our results reveal that the four methods generally exhibit good performance scaling with increasing matrix size and number of cores. SIMD vectorization only makes a large difference in performance for low number of cores. For a large matrix size (n ⩾ 16 K, the performance difference between the MiM and Original methods falls from 16× with four cores to 4× with 16 K cores. The efficiencies of all four methods are low with 1 K cores or more stressing a major problem of multi-core systems where the network-on-chip and memory latencies are too high in relation to basic arithmetic operations. Thus Gaussian Elimination can greatly benefit from the resources of multi-core systems, but higher performance gains can be achieved if multi-core systems can be designed with lower memory operation, synchronization, and interconnect communication latencies, requirements of utmost importance and challenge in the exascale computing age.

  1. Optimization of sparse matrix-vector multiplication on emerging multicore platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Vuduc, Richard [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Shalf, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelick, Katherine [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Demmel, James [Univ. of California, Berkeley, CA (United States)

    2007-01-01

    We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

  2. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Oliker, Leonid; Vuduc, Richard; Shalf, John; Yelick, Katherine; Demmel, James

    2008-10-16

    We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

  3. Effect of processor temperature on film dosimetry

    International Nuclear Information System (INIS)

    Srivastava, Shiv P.; Das, Indra J.

    2012-01-01

    Optical density (OD) of a radiographic film plays an important role in radiation dosimetry, which depends on various parameters, including beam energy, depth, field size, film batch, dose, dose rate, air film interface, postexposure processing time, and temperature of the processor. Most of these parameters have been studied for Kodak XV and extended dose range (EDR) films used in radiation oncology. There is very limited information on processor temperature, which is investigated in this study. Multiple XV and EDR films were exposed in the reference condition (d max. , 10 × 10 cm 2 , 100 cm) to a given dose. An automatic film processor (X-Omat 5000) was used for processing films. The temperature of the processor was adjusted manually with increasing temperature. At each temperature, a set of films was processed to evaluate OD at a given dose. For both films, OD is a linear function of processor temperature in the range of 29.4–40.6°C (85–105°F) for various dose ranges. The changes in processor temperature are directly related to the dose by a quadratic function. A simple linear equation is provided for the changes in OD vs. processor temperature, which could be used for correcting dose in radiation dosimetry when film is used.

  4. Optical Associative Processors For Visual Perception"

    Science.gov (United States)

    Casasent, David; Telfer, Brian

    1988-05-01

    We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.

  5. Computing trends using graphic processor in high energy physics

    CERN Document Server

    Niculescu, Mihai

    2011-01-01

    One of the main challenges in Heavy Energy Physics is to make fast analysis of high amount of experimental and simulated data. At LHC-CERN one p-p event is approximate 1 Mb in size. The time taken to analyze the data and obtain fast results depends on high computational power. The main advantage of using GPU(Graphic Processor Unit) programming over traditional CPU one is that graphical cards bring a lot of computing power at a very low price. Today a huge number of application(scientific, financial etc) began to be ported or developed for GPU, including Monte Carlo tools or data analysis tools for High Energy Physics. In this paper, we'll present current status and trends in HEP using GPU.

  6. Development of Innovative Design Processor

    International Nuclear Information System (INIS)

    Park, Y.S.; Park, C.O.

    2004-01-01

    The nuclear design analysis requires time-consuming and erroneous model-input preparation, code run, output analysis and quality assurance process. To reduce human effort and improve design quality and productivity, Innovative Design Processor (IDP) is being developed. Two basic principles of IDP are the document-oriented design and the web-based design. The document-oriented design is that, if the designer writes a design document called active document and feeds it to a special program, the final document with complete analysis, table and plots is made automatically. The active documents can be written with ordinary HTML editors or created automatically on the web, which is another framework of IDP. Using the proper mix-up of server side and client side programming under the LAMP (Linux/Apache/MySQL/PHP) environment, the design process on the web is modeled as a design wizard style so that even a novice designer makes the design document easily. This automation using the IDP is now being implemented for all the reload design of Korea Standard Nuclear Power Plant (KSNP) type PWRs. The introduction of this process will allow large reduction in all reload design efforts of KSNP and provide a platform for design and R and D tasks of KNFC. (authors)

  7. Onboard spectral imager data processor

    Science.gov (United States)

    Otten, Leonard J.; Meigs, Andrew D.; Franklin, Abraham J.; Sears, Robert D.; Robison, Mark W.; Rafert, J. Bruce; Fronterhouse, Donald C.; Grotbeck, Ronald L.

    1999-10-01

    Previous papers have described the concept behind the MightySat II.1 program, the satellite's Fourier Transform imaging spectrometer's optical design, the design for the spectral imaging payload, and its initial qualification testing. This paper discusses the on board data processing designed to reduce the amount of downloaded data by an order of magnitude and provide a demonstration of a smart spaceborne spectral imaging sensor. Two custom components, a spectral imager interface 6U VME card that moves data at over 30 MByte/sec, and four TI C-40 processors mounted to a second 6U VME and daughter card, are used to adapt the sensor to the spacecraft and provide the necessary high speed processing. A system architecture that offers both on board real time image processing and high-speed post data collection analysis of the spectral data has been developed. In addition to the on board processing of the raw data into a usable spectral data volume, one feature extraction technique has been incorporated. This algorithm operates on the basic interferometric data. The algorithm is integrated within the data compression process to search for uploadable feature descriptions.

  8. A data base processor semantics specification package

    Science.gov (United States)

    Fishwick, P. A.

    1983-01-01

    A Semantics Specification Package (DBPSSP) for the Intel Data Base Processor (DBP) is defined. DBPSSP serves as a collection of cross assembly tools that allow the analyst to assemble request blocks on the host computer for passage to the DBP. The assembly tools discussed in this report may be effectively used in conjunction with a DBP compatible data communications protocol to form a query processor, precompiler, or file management system for the database processor. The source modules representing the components of DBPSSP are fully commented and included.

  9. Hardware trigger processor for the MDT system

    CERN Document Server

    AUTHOR|(SzGeCERN)757787; The ATLAS collaboration; Hazen, Eric; Butler, John; Black, Kevin; Gastler, Daniel Edward; Ntekas, Konstantinos; Taffard, Anyes; Martinez Outschoorn, Verena; Ishino, Masaya; Okumura, Yasuyuki

    2017-01-01

    We are developing a low-latency hardware trigger processor for the Monitored Drift Tube system in the Muon spectrometer. The processor will fit candidate Muon tracks in the drift tubes in real time, improving significantly the momentum resolution provided by the dedicated trigger chambers. We present a novel pure-FPGA implementation of a Legendre transform segment finder, an associative-memory alternative implementation, an ARM (Zynq) processor-based track fitter, and compact ATCA carrier board architecture. The ATCA architecture is designed to allow a modular, staged approach to deployment of the system and exploration of alternative technologies.

  10. Influence of fibre design and curvature on crosstalk in multi-core fibre

    Energy Technology Data Exchange (ETDEWEB)

    Egorova, O N; Astapovich, M S; Semjonov, S L; Dianov, E M [Fiber Optics Research Center, Russian Academy of Sciences, Moscow (Russian Federation); Melnikov, L A [Kotel' nikov Institute of Radio Engineering and Electronics, Russian Academy of Sciences, Saratov Branch, Saratov (Russian Federation); Salganskii, M Yu [G.G.Devyatykh Institute of Chemistry of High-Purity Substances, Russian Academy of Sciences, Nizhnii Novgorod (Russian Federation); Mishkin, S N; Nishchev, K N [N.P. Ogarev Mordovia State University, Physics and Chemistry Institute, Saransk (Russian Federation)

    2016-03-31

    We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)

  11. Influence of fibre design and curvature on crosstalk in multi-core fibre

    International Nuclear Information System (INIS)

    Egorova, O N; Astapovich, M S; Semjonov, S L; Dianov, E M; Melnikov, L A; Salganskii, M Yu; Mishkin, S N; Nishchev, K N

    2016-01-01

    We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)

  12. HEP specific benchmarks of virtual machines on multi-core CPU architectures

    International Nuclear Information System (INIS)

    Alef, M; Gable, I

    2010-01-01

    Virtualization technologies such as Xen can be used in order to satisfy the disparate and often incompatible system requirements of different user groups in shared-use computing facilities. This capability is particularly important for HEP applications, which often have restrictive requirements. The use of virtualization adds flexibility, however, it is essential that the virtualization technology place little overhead on the HEP application. We present an evaluation of the practicality of running HEP applications in multiple Virtual Machines (VMs) on a single multi-core Linux system. We use the benchmark suite used by the HEPiX CPU Benchmarking Working Group to give a quantitative evaluation relevant to the HEP community. Benchmarks are packaged inside VMs and then the VMs are booted onto a single multi-core system. Benchmarks are then simultaneously executed on each VM to simulate highly loaded VMs running HEP applications. These techniques are applied to a variety of multi-core CPU architectures and VM configurations.

  13. Photonics and Fiber Optics Processor Lab

    Data.gov (United States)

    Federal Laboratory Consortium — The Photonics and Fiber Optics Processor Lab develops, tests and evaluates high speed fiber optic network components as well as network protocols. In addition, this...

  14. Keystone Business Models for Network Security Processors

    OpenAIRE

    Arthur Low; Steven Muegge

    2013-01-01

    Network security processors are critical components of high-performance systems built for cybersecurity. Development of a network security processor requires multi-domain experience in semiconductors and complex software security applications, and multiple iterations of both software and hardware implementations. Limited by the business models in use today, such an arduous task can be undertaken only by large incumbent companies and government organizations. Neither the “fabless semiconductor...

  15. Real time monitoring of electron processors

    International Nuclear Information System (INIS)

    Nablo, S.V.; Kneeland, D.R.; McLaughlin, W.L.

    1995-01-01

    A real time radiation monitor (RTRM) has been developed for monitoring the dose rate (current density) of electron beam processors. The system provides continuous monitoring of processor output, electron beam uniformity, and an independent measure of operating voltage or electron energy. In view of the device's ability to replace labor-intensive dosimetry in verification of machine performance on a real-time basis, its application to providing archival performance data for in-line processing is discussed. (author)

  16. From functional programming to multicore parallelism: A case study based on Presburger Arithmetic

    DEFF Research Database (Denmark)

    Dung, Phan Anh; Hansen, Michael Reichhardt

    2011-01-01

    , we are interested in using PA in connection with the Duration Calculus Model Checker (DCMC) [5]. There are effective decision procedures for PA including Cooper’s algorithm and the Omega Test; however, their complexity is extremely high with doubly exponential lower bound and triply exponential upper...... bound [7]. We investigate these decision procedures in the context of multicore parallelism with the hope of exploiting multicore powers. Unfortunately, we are not aware of any prior parallelism research related to decision procedures for PA. The closest work is the preliminary results on parallelism...

  17. Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms

    Science.gov (United States)

    2009-05-01

    base pair)  Watson ‐ Crick  strand pairs that bind perfectly within pairs, but poorly across pairs. A variety  of  DNA  strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE

  18. High-dimensional quantum key distribution based on multicore fiber using silicon photonic integrated circuits

    DEFF Research Database (Denmark)

    Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld

    2017-01-01

    is intrinsically limited to 1 bit/photon. Here we propose and experimentally demonstrate, for the first time, a high-dimensional quantum key distribution protocol based on space division multiplexing in multicore fiber using silicon photonic integrated lightwave circuits. We successfully realized three mutually......-dimensional quantum states, and enables breaking the information efficiency limit of traditional quantum key distribution protocols. In addition, the silicon photonic circuits used in our work integrate variable optical attenuators, highly efficient multicore fiber couplers, and Mach-Zehnder interferometers, enabling...

  19. Multi-Core Emptiness Checking of Timed Büchi Automata using Inclusion Abstraction

    DEFF Research Database (Denmark)

    Laarman, Alfons; Olesen, Mads Chr.; Dalsgaard, Andreas

    2013-01-01

    This paper contributes to the multi-core model checking of timed automata (TA) with respect to liveness properties, by investigating checking of TA Büchi emptiness under the very coarse inclusion abstraction or zone subsumption, an open problem in this field. We show that in general Büchi emptiness...... parallel LTL model checking algorithm for timed automata. The algorithms are implemented in LTSmin, and experimental evaluations show the effectiveness and scalability of both contributions: subsumption halves the number of states in the real-world FDDI case study, and the multi-core algorithm yields...

  20. Core-to-core uniformity improvement in multi-core fiber Bragg gratings

    Science.gov (United States)

    Lindley, Emma; Min, Seong-Sik; Leon-Saval, Sergio; Cvetojevic, Nick; Jovanovic, Nemanja; Bland-Hawthorn, Joss; Lawrence, Jon; Gris-Sanchez, Itandehui; Birks, Tim; Haynes, Roger; Haynes, Dionne

    2014-07-01

    Multi-core fiber Bragg gratings (MCFBGs) will be a valuable tool not only in communications but also various astronomical, sensing and industry applications. In this paper we address some of the technical challenges of fabricating effective multi-core gratings by simulating improvements to the writing method. These methods allow a system designed for inscribing single-core fibers to cope with MCFBG fabrication with only minor, passive changes to the writing process. Using a capillary tube that was polished on one side, the field entering the fiber was flattened which improved the coverage and uniformity of all cores.

  1. Accuracy Limitations in Optical Linear Algebra Processors

    Science.gov (United States)

    Batsell, Stephen Gordon

    1990-01-01

    One of the limiting factors in applying optical linear algebra processors (OLAPs) to real-world problems has been the poor achievable accuracy of these processors. Little previous research has been done on determining noise sources from a systems perspective which would include noise generated in the multiplication and addition operations, noise from spatial variations across arrays, and from crosstalk. In this dissertation, we propose a second-order statistical model for an OLAP which incorporates all these system noise sources. We now apply this knowledge to determining upper and lower bounds on the achievable accuracy. This is accomplished by first translating the standard definition of accuracy used in electronic digital processors to analog optical processors. We then employ our second-order statistical model. Having determined a general accuracy equation, we consider limiting cases such as for ideal and noisy components. From the ideal case, we find the fundamental limitations on improving analog processor accuracy. From the noisy case, we determine the practical limitations based on both device and system noise sources. These bounds allow system trade-offs to be made both in the choice of architecture and in individual components in such a way as to maximize the accuracy of the processor. Finally, by determining the fundamental limitations, we show the system engineer when the accuracy desired can be achieved from hardware or architecture improvements and when it must come from signal pre-processing and/or post-processing techniques.

  2. Architectural design and analysis of a programmable image processor

    International Nuclear Information System (INIS)

    Siyal, M.Y.; Chowdhry, B.S.; Rajput, A.Q.K.

    2003-01-01

    In this paper we present an architectural design and analysis of a programmable image processor, nicknamed Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. (author)

  3. Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes

    International Nuclear Information System (INIS)

    Tramm, J.R.; Siegel, A.R.

    2013-01-01

    The simulation of whole nuclear cores through the use of Monte Carlo codes requires an impracticably long time-to-solution. We have extracted a kernel that executes only the most computationally expensive steps of the Monte Carlo particle transport algorithm - the calculation of macroscopic cross sections - in an effort to expose bottlenecks within multi-core, shared memory architectures. (authors)

  4. Design of Multi-core Fiber Patch Panel for Space Division Multiplexing Implementations

    DEFF Research Database (Denmark)

    Gonzalez, Luz E.; Morales, Alvaro; Rommel, Simon

    2018-01-01

    A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 d...

  5. Wavelength-Dependence of Inter-Core Crosstalk in Homogeneous Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Saitoh, Kunimasa; Takenaga, Katsuhiro

    2016-01-01

    The wavelength dependence of inter-core crosstalk in homogeneous multi-core fibers (MCFs) is investigated, and the corresponding analytical expressions are derived. The derived analytical expressions can be used to determine the crosstalk at any wavelength necessary for designing future MCF...

  6. Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

    KAUST Repository

    Wu, Xingfu; Taylor, Valerie

    2013-01-01

    In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore supercomputers: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks and Intel's MPI benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore supercomputers because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyrokinetic Toroidal Code (GTC) in magnetic fusion to validate our performance model of the hybrid application on these multicore supercomputers. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore supercomputers. © 2013 Elsevier Inc.

  7. Design of multi-core fiber patch panel for space division multiplexing implementations

    NARCIS (Netherlands)

    González, Luz E.; Morales, Alvaro; Rommel, Simon; Jørgensen, Bo F.; Porras-Montenegro, N.; Tafur Monroy, Idelfonso

    2018-01-01

    A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 dB

  8. 32-core Inline Multicore Fiber Amplifier for Dense Space Division Multiplexed Transmission Systems

    DEFF Research Database (Denmark)

    Jain, S.; Mizuno, T.; Jung, Y.

    2016-01-01

    We present a high-core-count SDM amplifier, i.e. 32-core multicore-fiber amplifier, in a cladding-pumped configuration. An average gain of 17dB and NF of 7dB is obtained for -5dBm input signal power in the wavelength range 1544nm-1564nm....

  9. A model for the design and programming of multi-cores

    NARCIS (Netherlands)

    Jesshope, C.; Grandinetti, L.

    2008-01-01

    This paper describes a machine/programming model for the era of multi-core chips. It is derived from the sequential model but replaces sequential composition with concurrent composition at all levels in the program except at the level where the compiler is able to make deterministic decisions on

  10. Design of homogeneous trench-assisted multi-core fibers based on analytical model

    DEFF Research Database (Denmark)

    Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

    2016-01-01

    We present a design method of homogeneous trench-assisted multicore fibers (TA-MCFs) based on an analytical model utilizing an analytical expression for the mode coupling coefficient between two adjacent cores. The analytical model can also be used for crosstalk (XT) properties analysis, such as ...

  11. MulticoreBSP for C : A high-performance library for shared-memory parallel programming

    NARCIS (Netherlands)

    Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.

    2014-01-01

    The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the

  12. Ultrathin endoscopes based on multicore fibers and adaptive optics: a status review and perspectives.

    Science.gov (United States)

    Andresen, Esben Ravn; Sivankutty, Siddharth; Tsvirkun, Viktor; Bouwmans, Géraud; Rigneault, Hervé

    2016-12-01

    We take stock of the progress that has been made into developing ultrathin endoscopes assisted by wave front shaping. We focus our review on multicore fiber-based lensless endoscopes intended for multiphoton imaging applications. We put the work into perspective by comparing with alternative approaches and by outlining the challenges that lie ahead.

  13. Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

    KAUST Repository

    Wu, Xingfu

    2013-12-01

    In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore supercomputers: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks and Intel\\'s MPI benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore supercomputers because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyrokinetic Toroidal Code (GTC) in magnetic fusion to validate our performance model of the hybrid application on these multicore supercomputers. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore supercomputers. © 2013 Elsevier Inc.

  14. GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops

    DEFF Research Database (Denmark)

    Swenson, M Shel; Anderson, Joshua; Ash, Andrew

    2012-01-01

    achieved significant improvements in runtime, but their implementations were not portable from niche high-performance computers or easily accessible to most RNA researchers. With the increasing prevalence of multi-core desktop machines, a new parallel prediction program is needed to take full advantage...

  15. High-Spatial-Multiplicity Multicore Fibers for Future Dense Space-Division-Multiplexing Systems

    DEFF Research Database (Denmark)

    Matsuo, Shoichiro; Takenaga, Katsuhiro; Sasaki, Yusuke

    2016-01-01

    Multicore fibers and few-mode fibers have potential application in realizing dense-space-division multiplexing systems. However, there are some tradeoff requirements for designing the fibers. In this paper, the tradeoff requirements such as spatial channel count, crosstalk, differential mode dela...

  16. PERFORMANCE EVALUATION OF OR1200 PROCESSOR WITH EVOLUTIONARY PARALLEL HPRC USING GEP

    Directory of Open Access Journals (Sweden)

    R. Maheswari

    2012-04-01

    Full Text Available In this fast computing era, most of the embedded system requires more computing power to complete the complex function/ task at the lesser amount of time. One way to achieve this is by boosting up the processor performance which allows processor core to run faster. This paper presents a novel technique of increasing the performance by parallel HPRC (High Performance Reconfigurable Computing in the CPU/DSP (Digital Signal Processor unit of OR1200 (Open Reduced Instruction Set Computer (RISC 1200 using Gene Expression Programming (GEP an evolutionary programming model. OR1200 is a soft-core RISC processor of the Intellectual Property cores that can efficiently run any modern operating system. In the manufacturing process of OR1200 a parallel HPRC is placed internally in the Integer Execution Pipeline unit of the CPU/DSP core to increase the performance. The GEP Parallel HPRC is activated /deactivated by triggering the signals i HPRC_Gene_Start ii HPRC_Gene_End. A Verilog HDL(Hardware Description language functional code for Gene Expression Programming parallel HPRC is developed and synthesised using XILINX ISE in the former part of the work and a CoreMark processor core benchmark is used to test the performance of the OR1200 soft core in the later part of the work. The result of the implementation ensures the overall speed-up increased to 20.59% by GEP based parallel HPRC in the execution unit of OR1200.

  17. Realization Of Algebraic Processor For XML Documents Processing

    International Nuclear Information System (INIS)

    Georgiev, Bozhidar; Georgieva, Adriana

    2010-01-01

    In this paper, are presented some possibilities concerning the implementation of an algebraic method for XML hierarchical data processing which makes faster the XML search mechanism. Here is offered a different point of view for creation of advanced algebraic processor (with all necessary software tools and programming modules respectively). Therefore, this nontraditional approach for fast XML navigation with the presented algebraic processor may help to build an easier user-friendly interface provided XML transformations, which can avoid the difficulties in the complicated language constructions of XSL, XSLT and XPath. This approach allows comparatively simple search of XML hierarchical data by means of the following types of functions: specification functions and so named build-in functions. The choice of programming language Java may appear strange at first, but it isn't when you consider that the applications can run on different kinds of computers. The specific search mechanism based on the linear algebra theory is faster in comparison with MSXML parsers (on the basis of the developed examples with about 30%). Actually, there exists the possibility for creating new software tools based on the linear algebra theory, which cover the whole navigation and search techniques characterizing XSLT/XPath. The proposed method is able to replace more complicated operations in other SOA components.

  18. Point and track-finding processors for multiwire chambers

    CERN Document Server

    Hansroul, M

    1973-01-01

    The hardware processors described below are designed to be used in conjunction with multi-wire chambers. They have the characteristic of being based on computational methods in contrast to analogue procedures. In a sense, they are hardware implementations of computer programs. But, being specially designed for their purpose, they are free of the restrictions imposed by the architecture of the computer on which the equivalent program is to run. The parallelism inherent in the algorithms can thus be fully exploited. Combined with the use of fast access scratch-pad memories and the non-sequential nature of the control program, the parallelism accounts for the fact that these processors are expected to execute 2-3 orders of magnitude faster than the equivalent Fortran programs on a CDC 7600 or 6600. As a consequence, methods which are simple and straightforward, but which are impractical because they require an exorbitant amount of computer time can on the contrary be very attractive for hardware implementation. ...

  19. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    Science.gov (United States)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  20. Parallel Implementation of the Discrete Green's Function Formulation of the FDTD Method on a Multicore Central Processing Unit

    Directory of Open Access Journals (Sweden)

    T. Stefański

    2014-12-01

    Full Text Available Parallel implementation of the discrete Green's function formulation of the finite-difference time-domain (DGF-FDTD method was developed on a multicore central processing unit. DGF-FDTD avoids computations of the electromagnetic field in free-space cells and does not require domain termination by absorbing boundary conditions. Computed DGF-FDTD solutions are compatible with the FDTD grid enabling the perfect hybridization of FDTD with the use of time-domain integral equation methods. The developed implementation can be applied to simulations of antenna characteristics. For the sake of example, arrays of Yagi-Uda antennas were simulated with the use of parallel DGF-FDTD. The efficiency of parallel computations was investigated as a function of the number of current elements in the FDTD grid. Although the developed method does not apply the fast Fourier transform for convolution computations, advantages stemming from the application of DGF-FDTD instead of FDTD can be demonstrated for one-dimensional wire antennas when simulation results are post-processed by the near-to-far-field transformation.

  1. Magnetophoretic separation ICP-MS immunoassay using Cs-doped multicore magnetic nanoparticles for the determination of salmonella typhimurium.

    Science.gov (United States)

    Jeong, Arong; Lim, H B

    2018-02-01

    In this work, a magnetophoretic separation ICP-MS immunoassay using newly synthesized multicore magnetic nanoparticles (MMNPs) was developed for the determination of salmonella typhimurium (typhi). The uniqueness of this method was the use of MMNPs doped with Cs for both separation and detection, which enable us to achieve fast analysis, high sensitivity, and good reliability. For demonstration, heat-killed typhi in a phosphate buffer solution was determined by ICP-MS after the MMNP-typhi reaction product was separated from unreacted MMNPs in a micropipette tip filled with 25% polyethylene glycol through magnetophoretic separation. The calibration curve obtained by plotting 133 Cs intensity vs. the number of synthetic standard, showed a coefficient of determination (R 2 ) of 0.94 with a limit of detection (LOD) of 102 cells/mL without cell culturing. Excellent recoveries, between 98-100%, were obtained from four replicates and compared with a sandwich-type ICP-MS immunoassay for further confirmation. Copyright © 2017 Elsevier B.V. All rights reserved.

  2. Control structures for high speed processors

    Science.gov (United States)

    Maki, G. K.; Mankin, R.; Owsley, P. A.; Kim, G. M.

    1982-01-01

    A special processor was designed to function as a Reed Solomon decoder with throughput data rate in the Mhz range. This data rate is significantly greater than is possible with conventional digital architectures. To achieve this rate, the processor design includes sequential, pipelined, distributed, and parallel processing. The processor was designed using a high level language register transfer language. The RTL can be used to describe how the different processes are implemented by the hardware. One problem of special interest was the development of dependent processes which are analogous to software subroutines. For greater flexibility, the RTL control structure was implemented in ROM. The special purpose hardware required approximately 1000 SSI and MSI components. The data rate throughput is 2.5 megabits/second. This data rate is achieved through the use of pipelined and distributed processing. This data rate can be compared with 800 kilobits/second in a recently proposed very large scale integration design of a Reed Solomon encoder.

  3. Embedded processor extensions for image processing

    Science.gov (United States)

    Thevenin, Mathieu; Paindavoine, Michel; Letellier, Laurent; Heyrman, Barthélémy

    2008-04-01

    The advent of camera phones marks a new phase in embedded camera sales. By late 2009, the total number of camera phones will exceed that of both conventional and digital cameras shipped since the invention of photography. Use in mobile phones of applications like visiophony, matrix code readers and biometrics requires a high degree of component flexibility that image processors (IPs) have not, to date, been able to provide. For all these reasons, programmable processor solutions have become essential. This paper presents several techniques geared to speeding up image processors. It demonstrates that a gain of twice is possible for the complete image acquisition chain and the enhancement pipeline downstream of the video sensor. Such results confirm the potential of these computing systems for supporting future applications.

  4. Development methods for VLSI-processors

    International Nuclear Information System (INIS)

    Horninger, K.; Sandweg, G.

    1982-01-01

    The aim of this project, which was originally planed for 3 years, was the development of modern system and circuit concepts, for VLSI-processors having a 32 bit wide data path. The result of this first years work is the concept of a general purpose processor. This processor is not only logically but also physically (on the chip) divided into four functional units: a microprogrammable instruction unit, an execution unit in slice technique, a fully associative cache memory and an I/O unit. For the ALU of the execution unit circuits in PLA and slice techniques have been realized. On the basis of regularity, area consumption and achievable performance the slice technique has been prefered. The designs utilize selftesting circuitry. (orig.) [de

  5. High performance in silico virtual drug screening on many-core processors

    Science.gov (United States)

    Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-01-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727

  6. High performance in silico virtual drug screening on many-core processors.

    Science.gov (United States)

    McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-05-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.

  7. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  8. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2014-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  9. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Experiments at the LHC hadron collider search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and instantaneus luminosity increase, increasingly complex and exclusive selections are necessary. We present results and performances of a new prototype of Associative Memory (AM) system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the ATLAS experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the "combinatorial challenge", is beat by the AM technology exploiting parallelism to the maximum level. The Associative Memory compares the event to pre-calculated "expectations" or "patterns" (pattern matching) at once and look for candidate tracks called "roads". The problem is solved by the time data are loaded into the AM devices. We report ...

  10. Software-defined reconfigurable microwave photonics processor.

    Science.gov (United States)

    Pérez, Daniel; Gasulla, Ivana; Capmany, José

    2015-06-01

    We propose, for the first time to our knowledge, a software-defined reconfigurable microwave photonics signal processor architecture that can be integrated on a chip and is capable of performing all the main functionalities by suitable programming of its control signals. The basic configuration is presented and a thorough end-to-end design model derived that accounts for the performance of the overall processor taking into consideration the impact and interdependencies of both its photonic and RF parts. We demonstrate the model versatility by applying it to several relevant application examples.

  11. Time Manager Software for a Flight Processor

    Science.gov (United States)

    Zoerne, Roger

    2012-01-01

    Data analysis is a process of inspecting, cleaning, transforming, and modeling data to highlight useful information and suggest conclusions. Accurate timestamps and a timeline of vehicle events are needed to analyze flight data. By moving the timekeeping to the flight processor, there is no longer a need for a redundant time source. If each flight processor is initially synchronized to GPS, they can freewheel and maintain a fairly accurate time throughout the flight with no additional GPS time messages received. How ever, additional GPS time messages will ensure an even greater accuracy. When a timestamp is required, a gettime function is called that immediately reads the time-base register.

  12. Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

    Directory of Open Access Journals (Sweden)

    Abdul Kareem PARCHUR

    2012-08-01

    Full Text Available High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310. The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many key application areas in modern generation. The scaling of performance in two major series of Intel Xeon processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310 has been analyzed using the performance numbers of 12 CPU2006 integer benchmarks, performance numbers that exhibit significant differences in performance. The results and analysis can be used by performance engineers, scientists and developers to better understand the performance scaling in modern generation processors.

  13. Simulation of a parallel processor on a serial processor: The neutron diffusion equation

    International Nuclear Information System (INIS)

    Honeck, H.C.

    1981-01-01

    Parallel processors could provide the nuclear industry with very high computing power at a very moderate cost. Will we be able to make effective use of this power. This paper explores the use of a very simple parallel processor for solving the neutron diffusion equation to predict power distributions in a nuclear reactor. We first describe a simple parallel processor and estimate its theoretical performance based on the current hardware technology. Next, we show how the parallel processor could be used to solve the neutron diffusion equation. We then present the results of some simulations of a parallel processor run on a serial processor and measure some of the expected inefficiencies. Finally we extrapolate the results to estimate how actual design codes would perform. We find that the standard numerical methods for solving the neutron diffusion equation are still applicable when used on a parallel processor. However, some simple modifications to these methods will be necessary if we are to achieve the full power of these new computers. (orig.) [de

  14. Performance enhancement of multi-core fiber transmission using real-time FPGA based pre-emphasis

    NARCIS (Netherlands)

    Hasanuzzaman, G. K.M.; Spolitis, S.; Salgals, T.; Braunfelds, J.; Morales, A.; Gonzalez, L. E.; Rommel, S.; Puerta, R.; Asensio, P.; Bobrovs, V.; Iezekiel, S.; Tafur Monroy, I.

    2017-01-01

    We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA.

  15. Noise limitations in optical linear algebra processors.

    Science.gov (United States)

    Batsell, S G; Jong, T L; Walkup, J F; Krile, T F

    1990-05-10

    A general statistical noise model is presented for optical linear algebra processors. A statistical analysis which includes device noise, the multiplication process, and the addition operation is undertaken. We focus on those processes which are architecturally independent. Finally, experimental results which verify the analytical predictions are also presented.

  16. Cassava processors' awareness of occupational and environmental ...

    African Journals Online (AJOL)

    A larger percentage (74.5%) of the respondents indicated that the Agricultural Development Programme (ADP) is their source of information. The result also showed that processor's awareness of occupational hazards associated with the different stages of cassava processing vary because their involvement in these stages

  17. A high-speed analog neural processor

    NARCIS (Netherlands)

    Masa, P.; Masa, Peter; Hoen, Klaas; Hoen, Klaas; Wallinga, Hans

    1994-01-01

    Targeted at high-energy physics research applications, our special-purpose analog neural processor can classify up to 70 dimensional vectors within 50 nanoseconds. The decision-making process of the implemented feedforward neural network enables this type of computation to tolerate weight

  18. Beeldverwerking met de Micron Automatic Processor

    OpenAIRE

    Goyens, Frank

    2017-01-01

    Deze thesis is een onderzoek naar toepassingen binnen beeldverwerking op de Micron Automata Processor hardware. De hardware wordt vergeleken met populaire hedendaagse hardware. Ook bevat dit onderzoek nuttige informatie en strategieën voor het ontwikkelen van nieuwe toepassingen. Bevindingen in dit onderzoek omvatten proof of concept algoritmes en een praktische toepassing.

  19. 7 CFR 1215.14 - Processor.

    Science.gov (United States)

    2010-01-01

    ... 7 Agriculture 10 2010-01-01 2010-01-01 false Processor. 1215.14 Section 1215.14 Agriculture Regulations of the Department of Agriculture (Continued) AGRICULTURAL MARKETING SERVICE (MARKETING AGREEMENTS... CONSUMER INFORMATION Popcorn Promotion, Research, and Consumer Information Order Definitions § 1215.14...

  20. Simplifying cochlear implant speech processor fitting

    NARCIS (Netherlands)

    Willeboer, C.

    2008-01-01

    Conventional fittings of the speech processor of a cochlear implant (CI) rely to a large extent on the implant recipient's subjective responses. For each of the 22 intracochlear electrodes the recipient has to indicate the threshold level (T-level) and comfortable loudness level (C-level) while

  1. Vector and parallel processors in computational science

    International Nuclear Information System (INIS)

    Duff, I.S.; Reid, J.K.

    1985-01-01

    This book presents the papers given at a conference which reviewed the new developments in parallel and vector processing. Topics considered at the conference included hardware (array processors, supercomputers), programming languages, software aids, numerical methods (e.g., Monte Carlo algorithms, iterative methods, finite elements, optimization), and applications (e.g., neutron transport theory, meteorology, image processing)

  2. Space Station Water Processor Process Pump

    Science.gov (United States)

    Parker, David

    1995-01-01

    This report presents the results of the development program conducted under contract NAS8-38250-12 related to the International Space Station (ISS) Water Processor (WP) Process Pump. The results of the Process Pumps evaluation conducted on this program indicates that further development is required in order to achieve the performance and life requirements for the ISSWP.

  3. Interleaved Subtask Scheduling on Multi Processor SOC

    NARCIS (Netherlands)

    Zhe, M.

    2006-01-01

    The ever-progressing semiconductor processing technique has integrated more and more embedded processors on a single system-on-achip (SoC). With such powerful SoC platforms, and also due to the stringent time-to-market deadlines, many functionalities which used to be implemented in ASICs are

  4. User manual Dieka PreProcessor

    NARCIS (Netherlands)

    Valkering, Kasper

    2000-01-01

    This is the user manual belonging to the Dieka-PreProcessor. This application was written by Wenhua Cao and revised and expanded by Kasper Valkering. The aim of this preproccesor is to be able to draw and mesh extrusion dies in ProEngineer, and do the FE-calculation in Dieka. The preprocessor makes

  5. Globe hosts launch of new processor

    CERN Multimedia

    2006-01-01

    Launch of the quadecore processor chip at the Globe. On 14 November, in a series of major media events around the world, the chip-maker Intel launched its new 'quadcore' processor. For the regions of Europe, the Middle East and Africa, the day-long launch event took place in CERN's Globe of Science and Innovation, with over 30 journalists in attendance, coming from as far away as Johannesburg and Dubai. CERN was a significant choice for the event: the first tests of this new generation of processor in Europe had been made at CERN over the preceding months, as part of CERN openlab, a research partnership with leading IT companies such as Intel, HP and Oracle. The event also provided the opportunity for the journalists to visit ATLAS and the CERN Computer Centre. The strategy of putting multiple processor cores on the same chip, which has been pursued by Intel and other chip-makers in the last few years, represents an important departure from the more traditional improvements in the sheer speed of such chips. ...

  6. Event analysis using a massively parallel processor

    International Nuclear Information System (INIS)

    Bale, A.; Gerelle, E.; Messersmith, J.; Warren, R.; Hoek, J.

    1990-01-01

    This paper describes a system for performing histogramming of n-tuple data at interactive rates using a commercial SIMD processor array connected to a work-station running the well-known Physics Analysis Workstation software (PAW). Results indicate that an order of magnitude performance improvement over current RISC technology is easily achievable

  7. Theoretical Investigation of Inter-core Crosstalk Properties in Homogeneous Trench-Assisted Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Morioka, Toshio; Tu, Jiajing

    2014-01-01

    We derive analytical expressions for inter-core crosstalk, its dependence on core pitch and wavelength in homogeneous trench-assisted multi-core fibers. They are in excellent agreement with numerical simulation results.......We derive analytical expressions for inter-core crosstalk, its dependence on core pitch and wavelength in homogeneous trench-assisted multi-core fibers. They are in excellent agreement with numerical simulation results....

  8. Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems

    Science.gov (United States)

    2017-04-13

    AFRL-AFOSR-UK-TR-2017-0029 Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems ...2012, “ Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems .” 2. The objective...2012 - 01/25/2015 4. TITLE AND SUBTITLE Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous

  9. Performance enhancement of multi-core fiber transmission using real-time FPGA based pre-emphasis

    DEFF Research Database (Denmark)

    Hasanuzzaman, G. K.M.; Spolitis, Sandis; Salgals, T.

    2017-01-01

    We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA.......We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA....

  10. Data collection from FASTBUS to a DEC UNIBUS processor through the UNIBUS-Processor Interface

    International Nuclear Information System (INIS)

    Larwill, M.; Barsotti, E.; Lesny, D.; Pordes, R.

    1983-01-01

    This paper describes the use of the UNIBUS Processor Interface, an interface between FASTBUS and the Digital Equipment Corporation UNIBUS. The UPI was developed by Fermilab and the University of Illinois. Details of the use of this interface in a high energy physics experiment at Fermilab are given. The paper includes a discussion of the operation of the UPI on the UNIBUS of a VAX-11, and plans for using the UPI to perform data acquisition from FASTBUS to a VAX-11 Processor

  11. Array processors based on Gaussian fraction-free method

    Energy Technology Data Exchange (ETDEWEB)

    Peng, S; Sedukhin, S [Aizu Univ., Aizuwakamatsu, Fukushima (Japan); Sedukhin, I

    1998-03-01

    The design of algorithmic array processors for solving linear systems of equations using fraction-free Gaussian elimination method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that some array processors are optimal in the framework of linear allocation of computations and in terms of number of processing elements and computing time. (author)

  12. Reconfiguration in FPGA-Based Multi-Core Platforms for Hard Real-Time Applications

    DEFF Research Database (Denmark)

    Pezzarossa, Luca; Schoeberl, Martin; Sparsø, Jens

    2016-01-01

    -case execution-time of tasks of an application that determines the systems ability to respond in time. To support this focus, the platform must provide service guarantees for both communication and computation resources. In addition, many hard real-time applications have multiple modes of operation, and each......In general-purpose computing multi-core platforms, hardware accelerators and reconfiguration are means to improve performance; i.e., the average-case execution time of a software application. In hard real-time systems, such average-case speed-up is not in itself relevant - it is the worst...... mode has specific requirements. An interesting perspective on reconfigurable computing is to exploit run-time reconfiguration to support mode changes. In this paper we explore approaches to reconfiguration of communication and computation resources in the T-CREST hard real-time multi-core platform...

  13. Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber.

    Science.gov (United States)

    Zaghloul, Mohamed A S; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P

    2018-04-12

    Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores' temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%).

  14. A study of lock-free based concurrent garbage collectors for multicore platform.

    Science.gov (United States)

    Wu, Hao; Ji, Zhen-Zhou

    2014-01-01

    Concurrent garbage collectors (CGC) have recently obtained extensive concern on multicore platform. Excellent designed CGC can improve the efficiency of runtime systems by exploring the full potential processing resources of multicore computers. Two major performance critical components for designing CGC are studied in this paper, stack scanning and heap compaction. Since the lock-based algorithms do not scale well, we present a lock-free solution for constructing a highly concurrent garbage collector. We adopt CAS/MCAS synchronization primitives to guarantee that the programs will never be blocked by the collector thread while the garbage collection process is ongoing. The evaluation results of this study demonstrate that our approach achieves competitive performance.

  15. A Study of Lock-Free Based Concurrent Garbage Collectors for Multicore Platform

    Directory of Open Access Journals (Sweden)

    Hao Wu

    2014-01-01

    Full Text Available Concurrent garbage collectors (CGC have recently obtained extensive concern on multicore platform. Excellent designed CGC can improve the efficiency of runtime systems by exploring the full potential processing resources of multicore computers. Two major performance critical components for designing CGC are studied in this paper, stack scanning and heap compaction. Since the lock-based algorithms do not scale well, we present a lock-free solution for constructing a highly concurrent garbage collector. We adopt CAS/MCAS synchronization primitives to guarantee that the programs will never be blocked by the collector thread while the garbage collection process is ongoing. The evaluation results of this study demonstrate that our approach achieves competitive performance.

  16. Power-Aware DVB-H Mobile TV System on Heterogeneous Multicore Platform

    Directory of Open Access Journals (Sweden)

    Chao Han-Chieh

    2010-01-01

    Full Text Available In mobile communication network, the mobile device integrated with TV player is a novel technology that provides TV program services to end users. As TV program is a real-time video service, it has greater technical difficulties to overcome than a traditional video file download or online streaming, especially when TV programs are played on handheld devices. A challenge is how to save power in order to provide users with longer TV program services. To address this issue, this study proposes a mobile TV system on a heterogeneous multicore platform, which utilizes a Digital Video Broadcasting-Handheld (DVB-H wireless network to receive the TV program signal, thus, saving power according to the features of DVB-H TV signal and heterogeneous multi-core.

  17. Time-Predictable Communication on a Time-Division Multiplexing Network-on-Chip Multicore

    DEFF Research Database (Denmark)

    Sørensen, Rasmus Bo

    This thesis presents time-predictable inter-core communication on a multicore platform with a time-division multiplexing (TDM) network-on-chip (NoC) for hard real-time systems. The thesis is structured as a collection of papers that contribute within the areas of: reconfigurable TDM NoCs, static...... TDM scheduling, and time-predictable inter-core communication. More specifically, the work presented in this thesis investigates the interaction between hardware and software involved in time-predictable inter-core communication on the multicore platform. The thesis presents: a new generation...... of the Argo NoC network interface (NI) that supports instantaneous reconfiguration, a TDM traffic scheduler that generates virtual circuit (VC) configurations for the Argo NoC, and software functions for two types of intercore communication. The new generation of the Argo NoC adds the capability...

  18. Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber

    Directory of Open Access Journals (Sweden)

    Mohamed A. S. Zaghloul

    2018-04-01

    Full Text Available Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores’ temperature and strain coefficients are such that temperature (strain changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz, which is 2.63 (3.67 times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%.

  19. Asymmetric flow field-flow fractionation of superferrimagnetic iron oxide multicore nanoparticles

    DEFF Research Database (Denmark)

    Dutz, Silvio; Kuntsche, Judith; Eberbeck, Dietmar

    2012-01-01

    Magnetic nanoparticles are very useful for various medical applications where each application requires particles with specific magnetic properties. In this paper we describe the modification of the magnetic properties of magnetic multicore nanoparticles (MCNPs) by size dependent fractionation....... The hysteresis curves were measured by vibrating sample magnetometry. Starting from a coercivity of 1.41 kA m(-1) for the original MCNPs the coercivity of the particles in the different fractions varied from 0.41 to 3.83 kA m(-1). In our paper it is shown for the first time that fractions obtained from a broad...... size distributed MCNP fluid classified by AF4 show a strong correlation between hydrodynamic diameter and magnetic properties. Thus we state that AF4 is a suitable technology for reproducible size dependent classification of magnetic multicore nanoparticles suspended as ferrofluids....

  20. Simple analytical expression for crosstalk estimation in homogeneous trench-assisted multi-core fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

    2014-01-01

    An analytical expression for the mode coupling coe cient in homogeneous trench-assisted multi-core fibers is derived, which has a sim- ple relationship with the one in normal step-index structures. The amount of inter-core crosstalk reduction (in dB) with trench-assisted structures compared...... to the one with normal step-index structures can then be written by a simple expression. Comparison with numerical simulations confirms that the obtained analytical expression has very good accuracy for crosstalk estimation. The crosstalk properties in trench-assisted multi-core fibers, such as crosstalk...... dependence on core pitch and wavelength-dependent crosstalk, can be obtained by this simple analytical expression....

  1. Lipsi: Probably the Smallest Processor in the World

    DEFF Research Database (Denmark)

    Schoeberl, Martin

    2018-01-01

    While research on high-performance processors is important, it is also interesting to explore processor architectures at the other end of the spectrum: tiny processor cores for auxiliary functions. While it is common to implement small circuits for such functions, such as a serial port, in dedica...... at a minimal cost....

  2. Overview of the ATLAS Fast Tracker Project

    CERN Document Server

    AUTHOR|(INSPIRE)INSPIRE-00025195; The ATLAS collaboration

    2016-01-01

    The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge for the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency for interesting events despite the increase in multiple collisions per bunch crossing. In order to increase the use of tracks within the High Level Trigger, the ATLAS experiment planned the installation of a hardware processor dedicated to tracking: the Fast TracKer processor. The Fast Tracker is designed to perform full scan track reconstruction of every event accepted by the ATLAS first level hardware trigger. To achieve this goal the system uses a parallel architecture, with algorithms designed to exploit the computing power of custom Associative Memory chips, and modern field programmable gate arrays. The processor will provide computing power to reconstruct tracks with transverse momentum greater than 1 GeV in the whole trackin...

  3. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

    OpenAIRE

    Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann

    2012-01-01

    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research ...

  4. Spectral efficiency in crosstalk-impaired multi-core fiber links

    Science.gov (United States)

    Luís, Ruben S.; Puttnam, Benjamin J.; Rademacher, Georg; Klaus, Werner; Agrell, Erik; Awaji, Yoshinari; Wada, Naoya

    2018-02-01

    We review the latest advances on ultra-high throughput transmission using crosstalk-limited single-mode multicore fibers and compare these with the theoretical spectral efficiency of such systems. We relate the crosstalkimposed spectral efficiency limits with fiber parameters, such as core diameter, core pitch, and trench design. Furthermore, we investigate the potential of techniques such as direction interleaving and high-order MIMO to improve the throughput or reach of these systems when using various modulation formats.

  5. SkyAlign: a portable, work-efficient skyline algorithm for multicore and GPU architectures

    DEFF Research Database (Denmark)

    Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira

    2016-01-01

    The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely on ...... native multicore state of the art on challenging workloads by an increasing margin as more cores and sockets are utilised....

  6. Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine

    2009-04-10

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

  7. Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

    Science.gov (United States)

    Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

    2008-04-01

    This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.

  8. gFEX, the ATLAS Calorimeter Level-1 Real Time Processor

    CERN Document Server

    AUTHOR|(SzGeCERN)759889; The ATLAS collaboration; Begel, Michael; Chen, Hucheng; Lanni, Francesco; Takai, Helio; Wu, Weihao

    2016-01-01

    The global feature extractor (gFEX) is a component of the Level-1 Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Vertex Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 276 optical fibers with the data transferred at the 40 MHz Large Hadron Collider (LHC) clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor Field-Programmable Gate Array (FPGAs), monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA ...

  9. gFEX, the ATLAS Calorimeter Level 1 Real Time Processor

    CERN Document Server

    Tang, Shaochun; The ATLAS collaboration

    2015-01-01

    The global feature extractor (gFEX) is a component of the Level-1Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 264 optical fibers with the data transferred at the 40 MHz LHC clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor FPGAs, monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA has been designed for testing and verification. The performance ...

  10. Design of Processors with Reconfigurable Microarchitecture

    Directory of Open Access Journals (Sweden)

    Andrey Mokhov

    2014-01-01

    Full Text Available Energy becomes a dominating factor for a wide spectrum of computations: from intensive data processing in “big data” companies resulting in large electricity bills, to infrastructure monitoring with wireless sensors relying on energy harvesting. In this context it is essential for a computation system to be adaptable to the power supply and the service demand, which often vary dramatically during runtime. In this paper we present an approach to building processors with reconfigurable microarchitecture capable of changing the way they fetch and execute instructions depending on energy availability and application requirements. We show how to use Conditional Partial Order Graphs to formally specify the microarchitecture of such a processor, explore the design possibilities for its instruction set, and synthesise the instruction decoder using correct-by-construction techniques. The paper is focused on the design methodology, which is evaluated by implementing a power-proportional version of Intel 8051 microprocessor.

  11. Real time processor for array speckle interferometry

    International Nuclear Information System (INIS)

    Chin, G.; Florez, J.; Borelli, R.; Fong, W.; Miko, J.; Trujillo, C.

    1989-01-01

    With the construction of several new large aperture telescopes and the development of large format array detectors in the near IR, the ability to obtain diffraction limited seeing via IR array speckle interferometry offers a powerful tool. We are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element 2D complex FFT, and to average the power spectrum all within the 25 msec coherence time for speckles at near IR wavelength. The processor is a compact unit controlled by a PC with real time display and data storage capability. It provides the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with off-line methods

  12. Parallel processor programs in the Federal Government

    Science.gov (United States)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  13. RISC Processors and High Performance Computing

    Science.gov (United States)

    Bailey, David H.; Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    This tutorial will discuss the top five RISC microprocessors and the parallel systems in which they are used. It will provide a unique cross-machine comparison not available elsewhere. The effective performance of these processors will be compared by citing standard benchmarks in the context of real applications. The latest NAS Parallel Benchmarks, both absolute performance and performance per dollar, will be listed. The next generation of the NPB will be described. The tutorial will conclude with a discussion of future directions in the field. Technology Transfer Considerations: All of these computer systems are commercially available internationally. Information about these processors is available in the public domain, mostly from the vendors themselves. The NAS Parallel Benchmarks and their results have been previously approved numerous times for public release, beginning back in 1991.

  14. VIRTUS: a multi-processor system in FASTBUS

    International Nuclear Information System (INIS)

    Ellett, J.; Jackson, R.; Ritter, R.; Schlein, P.; Yaeger, D.; Zweizig, J.

    1986-01-01

    VIRTUS is a system of parallel MC68000-based processors interconnected by FASTBUS that is used either on-line as an intelligent trigger component or off-line for full event processing. Each processor receives the complete set of data from one event. The host computer, a VAX 11/780, down-line loads all software to the processors, controls and monitors the functioning of all processors, and writes processed data to tape. Instructions, programs, and data are transferred among the processors and the host in the form of fixed format, variable length data blocks. (Auth.)

  15. Resource sharing under global scheduling with partial processor bandwidth

    NARCIS (Netherlands)

    Afshar, Sara; Behnam, Moris; Bril, Reinder J.; Nolte, Thomas

    2015-01-01

    Resource efficient approaches are of great importance for resource constrained embedded systems. In this paper, we present an approach targeting systems where tasks of a critical application are partitioned on a multi-core platform and by using resource reservation techniques, the remaining

  16. Low-Latency Embedded Vision Processor (LLEVS)

    Science.gov (United States)

    2016-03-01

    algorithms, low-latency video processing, embedded image processor, wearable electronics, helmet-mounted systems, alternative night / day imaging...external subsystems and data sources with the device. The establishment of data interfaces in terms of data transfer rates, formats and types are...video signals from Near-visible Infrared (NVIR) sensor, Shortwave IR (SWIR) and Longwave IR (LWIR) is the main processing for Night Vision (NI) system

  17. Keystone Business Models for Network Security Processors

    Directory of Open Access Journals (Sweden)

    Arthur Low

    2013-07-01

    Full Text Available Network security processors are critical components of high-performance systems built for cybersecurity. Development of a network security processor requires multi-domain experience in semiconductors and complex software security applications, and multiple iterations of both software and hardware implementations. Limited by the business models in use today, such an arduous task can be undertaken only by large incumbent companies and government organizations. Neither the “fabless semiconductor” models nor the silicon intellectual-property licensing (“IP-licensing” models allow small technology companies to successfully compete. This article describes an alternative approach that produces an ongoing stream of novel network security processors for niche markets through continuous innovation by both large and small companies. This approach, referred to here as the "business ecosystem model for network security processors", includes a flexible and reconfigurable technology platform, a “keystone” business model for the company that maintains the platform architecture, and an extended ecosystem of companies that both contribute and share in the value created by innovation. New opportunities for business model innovation by participating companies are made possible by the ecosystem model. This ecosystem model builds on: i the lessons learned from the experience of the first author as a senior integrated circuit architect for providers of public-key cryptography solutions and as the owner of a semiconductor startup, and ii the latest scholarly research on technology entrepreneurship, business models, platforms, and business ecosystems. This article will be of interest to all technology entrepreneurs, but it will be of particular interest to owners of small companies that provide security solutions and to specialized security professionals seeking to launch their own companies.

  18. Silicon Processors Using Organically Reconfigurable Techniques (SPORT)

    Science.gov (United States)

    2014-05-19

    AFRL-OSR-VA-TR-2014-0132 SILICON PROCESSORS USING ORGANICALLY RECONFIGURABLE TECHNIQUES ( SPORT ) Dennis Prather UNIVERSITY OF DELAWARE Final Report 05...5a. CONTRACT NUMBER Silicon Processes for Organically Reconfigurable Techniques ( SPORT ) 5b. GRANT NUMBER FA9550-10-1-0363 5c...Contract: Silicon Processes for Organically Reconfigurable Techniques ( SPORT ) Contract #: FA9550-10-1-0363 Reporting Period: 1 July 2010 – 31 December

  19. Quantum chemistry on a superconducting quantum processor

    Energy Technology Data Exchange (ETDEWEB)

    Kaicher, Michael P.; Wilhelm, Frank K. [Theoretical Physics, Saarland University, 66123 Saarbruecken (Germany); Love, Peter J. [Department of Physics and Astronomy, Tufts University, Medford, MA 02155 (United States)

    2016-07-01

    Quantum chemistry is the most promising civilian application for quantum processors to date. We study its adaptation to superconducting (sc) quantum systems, computing the ground state energy of LiH through a variational hybrid quantum classical algorithm. We demonstrate how interactions native to sc qubits further reduce the amount of quantum resources needed, pushing sc architectures as a near-term candidate for simulations of more complex atoms/molecules.

  20. Debugging in a multi-processor environment

    International Nuclear Information System (INIS)

    Spann, J.M.

    1981-01-01

    The Supervisory Control and Diagnostic System (SCDS) for the Mirror Fusion Test Facility (MFTF) consists of nine 32-bit minicomputers arranged in a tightly coupled distributed computer system utilizing a share memory as the data exchange medium. Debugging of more than one program in the multi-processor environment is a difficult process. This paper describes what new tools were developed and how the testing of software is performed in the SCDS for the MFTF project

  1. Onboard Data Processors for Planetary Ice-Penetrating Sounding Radars

    Science.gov (United States)

    Tan, I. L.; Friesenhahn, R.; Gim, Y.; Wu, X.; Jordan, R.; Wang, C.; Clark, D.; Le, M.; Hand, K. P.; Plaut, J. J.

    2011-12-01

    Among the many concerns faced by outer planetary missions, science data storage and transmission hold special significance. Such missions must contend with limited onboard storage, brief data downlink windows, and low downlink bandwidths. A potential solution to these issues lies in employing onboard data processors (OBPs) to convert raw data into products that are smaller and closely capture relevant scientific phenomena. In this paper, we present the implementation of two OBP architectures for ice-penetrating sounding radars tasked with exploring Europa and Ganymede. Our first architecture utilizes an unfocused processing algorithm extended from the Mars Advanced Radar for Subsurface and Ionosphere Sounding (MARSIS, Jordan et. al. 2009). Compared to downlinking raw data, we are able to reduce data volume by approximately 100 times through OBP usage. To ensure the viability of our approach, we have implemented, simulated, and synthesized this architecture using both VHDL and Matlab models (with fixed-point and floating-point arithmetic) in conjunction with Modelsim. Creation of a VHDL model of our processor is the principle step in transitioning to actual digital hardware, whether in a FPGA (field-programmable gate array) or an ASIC (application-specific integrated circuit), and successful simulation and synthesis strongly indicate feasibility. In addition, we examined the tradeoffs faced in the OBP between fixed-point accuracy, resource consumption, and data product fidelity. Our second architecture is based upon a focused fast back projection (FBP) algorithm that requires a modest amount of computing power and on-board memory while yielding high along-track resolution and improved slope detection capability. We present an overview of the algorithm and details of our implementation, also in VHDL. With the appropriate tradeoffs, the use of OBPs can significantly reduce data downlink requirements without sacrificing data product fidelity. Through the development

  2. Intelligent trigger processor for the crystal box

    International Nuclear Information System (INIS)

    Sanders, G.H.; Butler, H.S.; Cooper, M.D.

    1981-01-01

    A large solid angle modular NaI(Tl) detector with 432 phototubes and 88 trigger scintillators is being used to search simultaneously for three lepton flavor changing decays of muon. A beam of up to 10 6 muons stopping per second with a 6% duty factor would yield up to 1000 triggers per second from random triple coincidences. A reduction of the trigger rate to 10 Hz is required from a hardwired primary trigger processor described in this paper. Further reduction to < 1 Hz is achieved by a microprocessor based secondary trigger processor. The primary trigger hardware imposes voter coincidence logic, stringent timing requirements, and a non-adjacency requirement in the trigger scintillators defined by hardwired circuits. Sophisticated geometric requirements are imposed by a PROM-based matrix logic, and energy and vector-momentum cuts are imposed by a hardwired processor using LSI flash ADC's and digital arithmetic loci. The secondary trigger employs four satellite microprocessors to do a sparse data scan, multiplex the data acquisition channels and apply additional event filtering

  3. Multibus-based parallel processor for simulation

    Science.gov (United States)

    Ogrady, E. P.; Wang, C.-H.

    1983-01-01

    A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.

  4. Code compression for VLIW embedded processors

    Science.gov (United States)

    Piccinelli, Emiliano; Sannino, Roberto

    2004-04-01

    The implementation of processors for embedded systems implies various issues: main constraints are cost, power dissipation and die area. On the other side, new terminals perform functions that require more computational flexibility and effort. Long code streams must be loaded into memories, which are expensive and power consuming, to run on DSPs or CPUs. To overcome this issue, the "SlimCode" proprietary algorithm presented in this paper (patent pending technology) can reduce the dimensions of the program memory. It can run offline and work directly on the binary code the compiler generates, by compressing it and creating a new binary file, about 40% smaller than the original one, to be loaded into the program memory of the processor. The decompression unit will be a small ASIC, placed between the Memory Controller and the System bus of the processor, keeping unchanged the internal CPU architecture: this implies that the methodology is completely transparent to the core. We present comparisons versus the state-of-the-art IBM Codepack algorithm, along with its architectural implementation into the ST200 VLIW family core.

  5. Techniques for optimizing inerting in electron processors

    International Nuclear Information System (INIS)

    Rangwalla, I.J.; Korn, D.J.; Nablo, S.V.

    1993-01-01

    The design of an ''inert gas'' distribution system in an electron processor must satisfy a number of requirements. The first of these is the elimination or control of beam produced ozone and NO x which can be transported from the process zone by the product into the work area. Since the tolerable levels for O 3 in occupied areas around the processor are 3 in the beam heated process zone, or exhausting and dilution of the gas at the processor exit. The second requirement of the inerting system is to provide a suitable environment for completing efficient, free radical initiated addition polymerization. The competition between radical loss through de-excitation and that from O 2 quenching must be understood. This group has used gas chromatographic analysis of electron cured coatings to study the trade-offs of delivered dose, dose rate and O 2 concentrations in the process zone to determine the tolerable ranges of parameter excursions for production quality control purposes. These techniques are described for an ink coating system on paperboard, where a broad range of process parameters have been studied (D, D radical, O 2 ). It is then shown how the technique is used to optimize the use of higher purity (10-100 ppm O 2 ) nitrogen gas for inerting, in combination with lower purity (2-20,000 ppm O 2 ) non-cryogenically produced gas, as from a membrane or pressure swing adsorption generators. (author)

  6. Multi-processor network implementations in Multibus II and VME

    International Nuclear Information System (INIS)

    Briegel, C.

    1992-01-01

    ACNET (Fermilab Accelerator Controls Network), a proprietary network protocol, is implemented in a multi-processor configuration for both Multibus II and VME. The implementations are contrasted by the bus protocol and software design goals. The Multibus II implementation provides for multiple processors running a duplicate set of tasks on each processor. For a network connected task, messages are distributed by a network round-robin scheduler. Further, messages can be stopped, continued, or re-routed for each task by user-callable commands. The VME implementation provides for multiple processors running one task across all processors. The process can either be fixed to a particular processor or dynamically allocated to an available processor depending on the scheduling algorithm of the multi-processing operating system. (author)

  7. Merged ozone profiles from four MIPAS processors

    Science.gov (United States)

    Laeng, Alexandra; von Clarmann, Thomas; Stiller, Gabriele; Dinelli, Bianca Maria; Dudhia, Anu; Raspollini, Piera; Glatthor, Norbert; Grabowski, Udo; Sofieva, Viktoria; Froidevaux, Lucien; Walker, Kaley A.; Zehner, Claus

    2017-04-01

    The Michelson Interferometer for Passive Atmospheric Sounding (MIPAS) was an infrared (IR) limb emission spectrometer on the Envisat platform. Currently, there are four MIPAS ozone data products, including the operational Level-2 ozone product processed at ESA, with the scientific prototype processor being operated at IFAC Florence, and three independent research products developed by the Istituto di Fisica Applicata Nello Carrara (ISAC-CNR)/University of Bologna, Oxford University, and the Karlsruhe Institute of Technology-Institute of Meteorology and Climate Research/Instituto de Astrofísica de Andalucía (KIT-IMK/IAA). Here we present a dataset of ozone vertical profiles obtained by merging ozone retrievals from four independent Level-2 MIPAS processors. We also discuss the advantages and the shortcomings of this merged product. As the four processors retrieve ozone in different parts of the spectra (microwindows), the source measurements can be considered as nearly independent with respect to measurement noise. Hence, the information content of the merged product is greater and the precision is better than those of any parent (source) dataset. The merging is performed on a profile per profile basis. Parent ozone profiles are weighted based on the corresponding error covariance matrices; the error correlations between different profile levels are taken into account. The intercorrelations between the processors' errors are evaluated statistically and are used in the merging. The height range of the merged product is 20-55 km, and error covariance matrices are provided as diagnostics. Validation of the merged dataset is performed by comparison with ozone profiles from ACE-FTS (Atmospheric Chemistry Experiment-Fourier Transform Spectrometer) and MLS (Microwave Limb Sounder). Even though the merging is not supposed to remove the biases of the parent datasets, around the ozone volume mixing ratio peak the merged product is found to have a smaller (up to 0.1 ppmv

  8. Application of an array processor to the analysis of magnetic data for the Doublet III tokamak

    International Nuclear Information System (INIS)

    Wang, T.S.; Saito, M.T.

    1980-08-01

    Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed

  9. ATLAS FTK: Fast Track Trigger

    CERN Document Server

    Volpi, Guido; The ATLAS collaboration

    2015-01-01

    An overview of the ATLAS Fast Tracker processor is presented, reporting the design of the system, its expected performance, and the integration status. The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge to the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency in interesting events, despite the increase in multiple p-p collisions per bunch crossing (pile-up). In order to increase the use of tracks within the High Level Trigger (HLT), the ATLAS experiment planned the installation of an hardware processor dedicated to tracking: the Fast TracKer (FTK) processor. The FTK is designed to perform full scan track reconstruction at every Level-1 accept. To achieve this goal, the FTK uses a fully parallel architecture, with algorithms designed to exploit the computing power of custom VLSI chips, the Associative Memory, as well as modern FPGAs. The FT...

  10. A Review Of Fault Tolerant Scheduling In Multicore Systems

    Directory of Open Access Journals (Sweden)

    Shefali Malhotra

    2015-05-01

    Full Text Available Abstract In this paper we have discussed about various fault tolerant task scheduling algorithm for multi core system based on hardware and software. Hardware based algorithm which is blend of Triple Modulo Redundancy and Double Modulo Redundancy in which Agricultural Vulnerability Factor is considered while deciding the scheduling other than EDF and LLF scheduling algorithms. In most of the real time system the dominant part is shared memory.Low overhead software based fault tolerance approach can be implemented at user-space level so that it does not require any changes at application level. Here redundant multi-threaded processes are used. Using those processes we can detect soft errors and recover from them. This method gives low overhead fast error detection and recovery mechanism. The overhead incurred by this method ranges from 0 to 18 for selected benchmarks. Hybrid Scheduling Method is another scheduling approach for real time systems. Dynamic fault tolerant scheduling gives high feasibility rate whereas task criticality is used to select the type of fault recovery method in order to tolerate the maximum number of faults.

  11. Modcomp MAX IV System Processors reference guide

    Energy Technology Data Exchange (ETDEWEB)

    Cummings, J.

    1990-10-01

    A user almost always faces a big problem when having to learn to use a new computer system. The information necessary to use the system is often scattered throughout many different manuals. The user also faces the problem of extracting the information really needed from each manual. Very few computer vendors supply a single Users Guide or even a manual to help the new user locate the necessary manuals. Modcomp is no exception to this, Modcomp MAX IV requires that the user be familiar with the system file usage which adds to the problem. At General Atomics there is an ever increasing need for new users to learn how to use the Modcomp computers. This paper was written to provide a condensed Users Reference Guide'' for Modcomp computer users. This manual should be of value not only to new users but any users that are not Modcomp computer systems experts. This Users Reference Guide'' is intended to provided the basic information for the use of the various Modcomp System Processors necessary to, create, compile, link-edit, and catalog a program. Only the information necessary to provide the user with a basic understanding of the Systems Processors is included. This document provides enough information for the majority of programmers to use the Modcomp computers without having to refer to any other manuals. A lot of emphasis has been placed on the file description and usage for each of the System Processors. This allows the user to understand how Modcomp MAX IV does things rather than just learning the system commands.

  12. Optical linear algebra processors - Architectures and algorithms

    Science.gov (United States)

    Casasent, David

    1986-01-01

    Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.

  13. Dual-scale topology optoelectronic processor.

    Science.gov (United States)

    Marsden, G C; Krishnamoorthy, A V; Esener, S C; Lee, S H

    1991-12-15

    The dual-scale topology optoelectronic processor (D-STOP) is a parallel optoelectronic architecture for matrix algebraic processing. The architecture can be used for matrix-vector multiplication and two types of vector outer product. The computations are performed electronically, which allows multiplication and summation concepts in linear algebra to be generalized to various nonlinear or symbolic operations. This generalization permits the application of D-STOP to many computational problems. The architecture uses a minimum number of optical transmitters, which thereby reduces fabrication requirements while maintaining area-efficient electronics. The necessary optical interconnections are space invariant, minimizing space-bandwidth requirements.

  14. Nuclear interactive evaluations on distributed processors

    International Nuclear Information System (INIS)

    Dix, G.E.; Congdon, S.P.

    1988-01-01

    BWR [boiling water reactor] nuclear design is a complicated process, involving trade-offs among a variety of conflicting objectives. Complex computer calculations and usually required for each design iteration. GE Nuclear Energy has implemented a system where the evaluations are performed interactively on a large number of small microcomputers. This approach minimizes the time it takes to carry out design iterations even through the processor speeds are low compared with modern super computers. All of the desktop microcomputers are linked to a common data base via an ethernet communications system so that design data can be shared and data quality can be maintained

  15. Lattice gauge theory using parallel processors

    International Nuclear Information System (INIS)

    Lee, T.D.; Chou, K.C.; Zichichi, A.

    1987-01-01

    The book's contents include: Lattice Gauge Theory Lectures: Introduction and Current Fermion Simulations; Monte Carlo Algorithms for Lattice Gauge Theory; Specialized Computers for Lattice Gauge Theory; Lattice Gauge Theory at Finite Temperature: A Monte Carlo Study; Computational Method - An Elementary Introduction to the Langevin Equation, Present Status of Numerical Quantum Chromodynamics; Random Lattice Field Theory; The GF11 Processor and Compiler; and The APE Computer and First Physics Results; Columbia Supercomputer Project: Parallel Supercomputer for Lattice QCD; Statistical and Systematic Errors in Numerical Simulations; Monte Carlo Simulation for LGT and Programming Techniques on the Columbia Supercomputer; Food for Thought: Five Lectures on Lattice Gauge Theory

  16. Introduction to programming multiple-processor computers

    International Nuclear Information System (INIS)

    Hicks, H.R.; Lynch, V.E.

    1985-04-01

    FORTRAN applications programs can be executed on multiprocessor computers in either a unitasking (traditional) or multitasking form. The latter allows a single job to use more than one processor simultaneously, with a consequent reduction in wall-clock time and, perhaps, the cost of the calculation. An introduction to programming in this environment is presented. The concepts of synchronization and data sharing using EVENTS and LOCKS are illustrated with examples. The strategy of strong synchronization and the use of synchronization templates are proposed. We emphasize that incorrect multitasking programs can produce irreproducible results, which makes debugging more difficult

  17. Recommending the heterogeneous cluster type multi-processor system computing

    International Nuclear Information System (INIS)

    Iijima, Nobukazu

    2010-01-01

    Real-time reactor simulator had been developed by reusing the equipment of the Musashi reactor and its performance improvement became indispensable for research tools to increase sampling rate with introduction of arithmetic units using multi-Digital Signal Processor(DSP) system (cluster). In order to realize the heterogeneous cluster type multi-processor system computing, combination of two kinds of Control Processor (CP) s, Cluster Control Processor (CCP) and System Control Processor (SCP), were proposed with Large System Control Processor (LSCP) for hierarchical cluster if needed. Faster computing performance of this system was well evaluated by simulation results for simultaneous execution of plural jobs and also pipeline processing between clusters, which showed the system led to effective use of existing system and enhancement of the cost performance. (T. Tanaka)

  18. SSC 254 Screen-Based Word Processors: Production Tests. The Lanier Word Processor.

    Science.gov (United States)

    Moyer, Ruth A.

    Designed for use in Trident Technical College's Secretarial Lab, this series of 12 production tests focuses on the use of the Lanier Word Processor for a variety of tasks. In tests 1 and 2, students are required to type and print out letters. Tests 3 through 8 require students to reformat a text; make corrections on a letter; divide and combine…

  19. Harnessing Petaflop-Scale Multi-Core Supercomputing for Problems in Space Science

    Science.gov (United States)

    Albright, B. J.; Yin, L.; Bowers, K. J.; Daughton, W.; Bergen, B.; Kwan, T. J.

    2008-12-01

    The particle-in-cell kinetic plasma code VPIC has been migrated successfully to the world's fastest supercomputer, Roadrunner, a hybrid multi-core platform built by IBM for the Los Alamos National Laboratory. How this was achieved will be described and examples of state-of-the-art calculations in space science, in particular, the study of magnetic reconnection, will be presented. With VPIC on Roadrunner, we have performed, for the first time, plasma PIC calculations with over one trillion particles, >100× larger than calculations considered "heroic" by community standards. This allows examination of physics at unprecedented scale and fidelity. Roadrunner is an example of an emerging paradigm in supercomputing: the trend toward multi-core systems with deep hierarchies and where memory bandwidth optimization is vital to achieving high performance. Getting VPIC to perform well on such systems is a formidable challenge: the core algorithm is memory bandwidth limited with low compute-to-data ratio and requires random access to memory in its inner loop. That we were able to get VPIC to perform and scale well, achieving >0.374 Pflop/s and linear weak scaling on real physics problems on up to the full 12240-core Roadrunner machine, bodes well for harnessing these machines for our community's needs in the future. Many of the design considerations encountered commute to other multi-core and accelerated (e.g., via GPU) platforms and we modified VPIC with flexibility in mind. These will be summarized and strategies for how one might adapt a code for such platforms will be shared. Work performed under the auspices of the U.S. DOE by the LANS LLC Los Alamos National Laboratory. Dr. Bowers is a LANL Guest Scientist; he is presently at D. E. Shaw Research LLC, 120 W 45th Street, 39th Floor, New York, NY 10036.

  20. Multiprocessor Real-Time Scheduling with Hierarchical Processor Affinities

    OpenAIRE

    Bonifaci , Vincenzo; Brandenburg , Björn; D'Angelo , Gianlorenzo; Marchetti-Spaccamela , Alberto

    2016-01-01

    International audience; Many multiprocessor real-time operating systems offer the possibility to restrict the migrations of any task to a specified subset of processors by setting affinity masks. A notion of " strong arbitrary processor affinity scheduling " (strong APA scheduling) has been proposed; this notion avoids schedulability losses due to overly simple implementations of processor affinities. Due to potential overheads, strong APA has not been implemented so far in a real-time operat...

  1. Acceleration of stereo-matching on multi-core CPU and GPU

    OpenAIRE

    Tian, Xu; Cockshott, Paul; Oehler, Susanne

    2014-01-01

    This paper presents an accelerated version of a\\ud dense stereo-correspondence algorithm for two different parallelism\\ud enabled architectures, multi-core CPU and GPU. The\\ud algorithm is part of the vision system developed for a binocular\\ud robot-head in the context of the CloPeMa 1 research project.\\ud This research project focuses on the conception of a new clothes\\ud folding robot with real-time and high resolution requirements\\ud for the vision system. The performance analysis shows th...

  2. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

    OpenAIRE

    Albutiu, Martina-Cezara; Kemper, Alfons; Neumann, Thomas

    2012-01-01

    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel mult...

  3. Design of Multi-core Fiber Patch Panel for Space Division Multiplexing Implementations

    Science.gov (United States)

    González, Luz E.; Morales, Alvaro; Rommel, Simon; Jørgensen, Bo F.; Porras-Montenegro, N.; Tafur Monroy, Idelfonso

    2018-03-01

    A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 dB respectively, while at 1600 nm insertion loss drops to 4.8 dB and crosstalk increases to -24.1 dB. Two MCF splices between the fan-in module, the MCF, and the fan-out module are included in the characterization, and splicing parameters are discussed.

  4. Linux block IO: introducing multi-queue SSD access on multi-core systems

    DEFF Research Database (Denmark)

    Bjørling, Matias; Axboe, Jens; Nellans, David

    2013-01-01

    The IO performance of storage devices has accelerated from hundreds of IOPS five years ago, to hundreds of thousands of IOPS today, and tens of millions of IOPS projected in five years. This sharp evolution is primarily due to the introduc- tion of NAND-flash devices and their data parallel desig...... generation block layer that is capable of handling tens of millions of IOPS on a multi-core system equipped with a single storage device. Our experiments show that our design scales graciously with the number of cores, even on NUMA systems with multiple sockets....

  5. New multicore low mode noise scrambling fiber for applications in high-resolution spectroscopy

    Science.gov (United States)

    Haynes, Dionne M.; Gris-Sanchez, Itandehui; Ehrlich, Katjana; Birks, Tim A.; Giannone, Domenico; Haynes, Roger

    2014-07-01

    We present a new type of multicore fiber (MCF) and photonic lantern that consists of 511 individual cores designed to operate over a broadband visible wavelength range (380-860nm). It combines the coupling efficiency of a multimode fiber with modal stability intrinsic to a single mode fibre. It is designed to provide phase and amplitude scrambling to achieve a stable near field and far field illumination pattern during input coupling variations; it also has low modal noise for increased photometric stability. Preliminary results are presented for the new MCF as well as current state of the art octagonal fiber for comparison.

  6. Distributed CPU multi-core implementation of SIRT with vectorized matrix kernel for micro-CT

    Energy Technology Data Exchange (ETDEWEB)

    Gregor, Jens [Tennessee Univ., Knoxville, TN (United States)

    2011-07-01

    We describe an implementation of SIRT for execution using a cluster of multi-core PCs. Algorithmic techniques are presented for reducing the size and computational cost of a reconstruction including near-optimal relaxation, scalar preconditioning, orthogonalized ordered subsets, and data-driven focus of attention. Implementation wise, a scheme is outlined which provides each core mutex-free access to its local shared memory while also balancing the workload across the cluster, and the system matrix is computed on-the-fly using vectorized code. Experimental results show the efficacy of the approach. (orig.)

  7. Tunable arbitrary unitary transformer based on multiple sections of multicore fibers with phase control.

    Science.gov (United States)

    Zhou, Junhe; Wu, Jianjie; Hu, Qinsong

    2018-02-05

    In this paper, we propose a novel tunable unitary transformer, which can achieve arbitrary discrete unitary transforms. The unitary transformer is composed of multiple sections of multi-core fibers with closely aligned coupled cores. Phase shifters are inserted before and after the sections to control the phases of the waves in the cores. A simple algorithm is proposed to find the optimal phase setup for the phase shifters to realize the desired unitary transforms. The proposed device is fiber based and is particularly suitable for the mode division multiplexing systems. A tunable mode MUX/DEMUX for a three-mode fiber is designed based on the proposed structure.

  8. Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code

    Energy Technology Data Exchange (ETDEWEB)

    Messer, Bronson [ORNL; Harris, James A [ORNL; Parete-Koon, Suzanne T [ORNL; Chertkow, Merek A [ORNL

    2013-01-01

    We describe recent development work on the core-collapse supernova code CHIMERA. CHIMERA has consumed more than 100 million cpu-hours on Oak Ridge Leadership Computing Facility (OLCF) platforms in the past 3 years, ranking it among the most important applications at the OLCF. Most of the work described has been focused on exploiting the multicore nature of the current platform (Jaguar) via, e.g., multithreading using OpenMP. In addition, we have begun a major effort to marshal the computational power of GPUs with CHIMERA. The impending upgrade of Jaguar to Titan a 20+ PF machine with an NVIDIA GPU on many nodes makes this work essential.

  9. Coordinated Energy Management in Heterogeneous Processors

    Directory of Open Access Journals (Sweden)

    Indrani Paul

    2014-01-01

    Full Text Available This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2 product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

  10. An Efficient Parallel SAT Solver Exploiting Multi-Core Environments, Phase II

    Data.gov (United States)

    National Aeronautics and Space Administration — The hundreds of stream cores in the latest graphics processors (GPUs), and the possibility to execute non-graphics computations on them, open unprecedented levels of...

  11. Expert System Constant False Alarm Rate (CFAR) Processor

    National Research Council Canada - National Science Library

    Wicks, Michael C

    2006-01-01

    An artificial intelligence system improves radar signal processor performance by increasing target probability of detection and reducing probability of false alarm in a severe radar clutter environment...

  12. Special processor for in-core control systems

    International Nuclear Information System (INIS)

    Golovanov, M.N.; Duma, V.R.; Levin, G.L.; Mel'nikov, A.V.; Polikanin, A.V.; Filatov, V.P.

    1978-01-01

    The BUTs-20 special processor is discussed, designed to control the units of the in-core control equipment which are incorporated into the VECTOR communication channel, and to provide preliminary data processing prior to computer calculations. A set of instructions and flowsheet of the processor, organization of its communication with memories and other units of the system are given. The processor components: a control unit and an arithmetic logical unit are discussed. It is noted that the special processor permits more effective utilization of the computer time

  13. Development of level 2 processor for the readout of TMC

    International Nuclear Information System (INIS)

    Arai, Y.; Ikeno, M.; Murata, T.; Sudo, F.; Emura, T.

    1995-01-01

    We have developed a prototype 8-bit processor for the level 2 data processing for the Time Memory Cell (TMC). The first prototype processor successfully runs with 18 MHz clock. The operation of same clock frequency as TMC (30 MHz) will be easily achieved with simple modifications. Although the processor is very primitive one but shows its powerful performance and flexibility. To realize the compact TMC/L2P (Level 2 Processor) system, it is better to include the microcode memory within the chip. Encoding logic of the microcode must be included to reduce the microcode memory in this case. (J.P.N.)

  14. Single-step generation of metal-plasma polymer multicore@shell nanoparticles from the gas phase.

    Science.gov (United States)

    Solař, Pavel; Polonskyi, Oleksandr; Olbricht, Ansgar; Hinz, Alexander; Shelemin, Artem; Kylián, Ondřej; Choukourov, Andrei; Faupel, Franz; Biederman, Hynek

    2017-08-17

    Nanoparticles composed of multiple silver cores and a plasma polymer shell (multicore@shell) were prepared in a single step with a gas aggregation cluster source operating with Ar/hexamethyldisiloxane mixtures and optionally oxygen. The size distribution of the metal inclusions as well as the chemical composition and the thickness of the shells were found to be controlled by the composition of the working gas mixture. Shell matrices ranging from organosilicon plasma polymer to nearly stoichiometric SiO 2 were obtained. The method allows facile fabrication of multicore@shell nanoparticles with tailored functional properties, as demonstrated here with the optical response.

  15. Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core

    OpenAIRE

    Marco Antonio Zanata Alves

    2009-01-01

    No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado a...

  16. Multi-core events in cosmic-ray induced interactions with lead at around 10 TeV

    International Nuclear Information System (INIS)

    Amato, N.; Arata, N.

    1989-01-01

    The analysis is made on the cosmic-ray induced interactions with lead at around 10 TeV on the basis of emulsion chamber data at Chacaltaya. A special attention is paid to the events detected as multi-cores under the spatial resolution of a few tens of microns. The observation of six double-core events and two triple-core events with the average invariant mass of 1.8 GeV/c 2 leads to the estimation on production frequency of such multicores as about 5% at 10 TeV at the atmospheric depth 540 gr/cm 2 . (author)

  17. A New and Simple Method for Crosstalk Estimation in Homogeneous Trench-Assisted Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

    2014-01-01

    A new and simple method for inter-core crosstalk estimation in homogeneous trench-assisted multi-core fibers is presented. The crosstalk calculated by this method agrees well with experimental measurement data for two kinds of fabricated 12-core fibers.......A new and simple method for inter-core crosstalk estimation in homogeneous trench-assisted multi-core fibers is presented. The crosstalk calculated by this method agrees well with experimental measurement data for two kinds of fabricated 12-core fibers....

  18. An accurate projection algorithm for array processor based SPECT systems

    International Nuclear Information System (INIS)

    King, M.A.; Schwinger, R.B.; Cool, S.L.

    1985-01-01

    A data re-projection algorithm has been developed for use in single photon emission computed tomography (SPECT) on an array processor based computer system. The algorithm makes use of an accurate representation of pixel activity (uniform square pixel model of intensity distribution), and is rapidly performed due to the efficient handling of an array based algorithm and the Fast Fourier Transform (FFT) on parallel processing hardware. The algorithm consists of using a pixel driven nearest neighbour projection operation to an array of subdivided projection bins. This result is then convolved with the projected uniform square pixel distribution before being compressed to original bin size. This distribution varies with projection angle and is explicitly calculated. The FFT combined with a frequency space multiplication is used instead of a spatial convolution for more rapid execution. The new algorithm was tested against other commonly used projection algorithms by comparing the accuracy of projections of a simulated transverse section of the abdomen against analytically determined projections of that transverse section. The new algorithm was found to yield comparable or better standard error and yet result in easier and more efficient implementation on parallel hardware. Applications of the algorithm include iterative reconstruction and attenuation correction schemes and evaluation of regions of interest in dynamic and gated SPECT

  19. Algorithms for computational fluid dynamics n parallel processors

    International Nuclear Information System (INIS)

    Van de Velde, E.F.

    1986-01-01

    A study of parallel algorithms for the numerical solution of partial differential equations arising in computational fluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries

  20. Data driven processor 'Vertex Trigger' for B experiments

    International Nuclear Information System (INIS)

    Hartouni, E.P.

    1993-01-01

    Data Driven Processors (DDP's) are specialized computation engines configured to solve specific numerical problems, such as vertex reconstruction. The architecture of the DDP which is the subject of this talk was designed and implemented by W. Sippach and B.C. Knapp at Nevis Lab. in the early 1980's. This particular implementation allows multiple parallel streams of data to provide input to a heterogenous collection of simple operators whose interconnection form an algorithm. The local data flow control allows this device to execute algorithms extremely quickly provided that care is taken in the layout of the algorithm. I/O rates of several hundred megabytes/second are routinely achieved thus making DDP's attractive candidates for complex online calculations. The original question was open-quote can a DDP reconstruct tracks in a Silicon Vertex Detector, find events with a separated vertex and do it fast enough to be used as an online trigger?close-quote Restating this inquiry as three questions and describing the answers to the questions will be the subject of this talk. The three specific questions are: (1) Can an algorithm be found which reconstructs tracks in a planar geometry and no magnetic field; (2) Can separated vertices be recognized in some way; (3) Can the algorithm be implemented in the Nevis-UMass and DDP and execute in 10-20 μs?