fast multicore processor: Topics by WorldWideScience.org

Sample records for fast multicore processor

Multi-Core Processor Memory Contention Benchmark Analysis Case Study

Science.gov (United States)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

Science.gov (United States)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
On the effective parallel programming of multi-core processors

NARCIS (Netherlands)

Varbanescu, A.L.

2010-01-01

Multi-core processors are considered now the only feasible alternative to the large single-core processors which have become limited by technological aspects such as power consumption and heat dissipation. However, due to their inherent parallel structure and their diversity, multi-cores are
Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

Science.gov (United States)

2013-09-30

1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the
Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

Science.gov (United States)

Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

2012-01-01

The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.
Interactive high-resolution isosurface ray casting on multicore processors.

Science.gov (United States)

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

Science.gov (United States)

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
Concurrent and Accurate Short Read Mapping on Multicore Processors.

Science.gov (United States)

Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

2015-01-01

We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.
Hardware Synchronization for Embedded Multi-Core Processors

DEFF Research Database (Denmark)

Stoif, Christian; Schoeberl, Martin; Liccardi, Benito

2011-01-01

Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper, establi......Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper...
A Workload-Adaptive and Reconfigurable Bus Architecture for Multicore Processors

Directory of Open Access Journals (Sweden)

Shoaib Akram

2010-01-01

Full Text Available Interconnection networks for multicore processors are traditionally designed to serve a diversity of workloads. However, different workloads or even different execution phases of the same workload may benefit from different interconnect configurations. In this paper, we first motivate the need for workload-adaptive interconnection networks. Subsequently, we describe an interconnection network framework based on reconfigurable switches for use in medium-scale (up to 32 cores shared memory multicore processors. Our cost-effective reconfigurable interconnection network is implemented on a traditional shared bus interconnect with snoopy-based coherence, and it enables improved multicore performance. The proposed interconnect architecture distributes the cores of the processor into clusters with reconfigurable logic between clusters to support workload-adaptive policies for inter-cluster communication. Our interconnection scheme is complemented by interconnect-aware scheduling and additional interconnect optimizations which help boost the performance of multiprogramming and multithreaded workloads. We provide experimental results that show that the overall throughput of multiprogramming workloads (consisting of two and four programs can be improved by up to 60% with our configurable bus architecture. Similar gains can be achieved also for multithreaded applications as shown by further experiments. Finally, we present the performance sensitivity of the proposed interconnect architecture on shared memory bandwidth availability.
A High Performance Multi-Core FPGA Implementation for 2D Pixel Clustering for the ATLAS Fast TracKer (FTK) Processor

CERN Document Server

Sotiropoulou, C-L; The ATLAS collaboration; Beretta, M; Gkaitatzis, S; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

2014-01-01

The high performance multi-core 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors read out drivers (RODs) at 760Gbps, the full rate of level 1 triggers. Clustering is required as a method to reduce the high rate of the received data before further processing, as well as to determine the cluster centroid for obtaining obtain the best spatial measurement. Our implementation targets the pixel detectors and uses a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The design is fully generic and the cluster detection window size can be adjusted for optimizing the cluster identification process. Τhe implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility mak...
Tinuso: A processor architecture for a multi-core hardware simulation platform

DEFF Research Database (Denmark)

Schleuniger, Pascal; Karlsson, Sven

2010-01-01

Multi-core systems have the potential to improve performance, energy and cost properties of embedded systems but also require new design methods and tools to take advantage of the new architectures. Due to the limited accuracy and performance of pure software simulators, we are working on a cycle...... accurate hardware simulation platform. We have developed the Tinuso processor architecture for this platform. Tinuso is a processor architecture optimized for FPGA implementation. The instruction set makes use of predicated instructions and supports C/C++ and assembly language programming. It is designed...... to be easy extendable to maintain the exibility required for the research on multi-core systems. Tinuso contains a co-processor interface to connect to a network interface. This interface allow for communication over an on-chip network. A clock frequency estimation study on a deeply pipelined Tinuso...
Architecture-Aware Optimization of an HEVC decoder on Asymmetric Multicore Processors

OpenAIRE

Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2016-01-01

Low-power asymmetric multicore processors (AMPs) attract considerable attention due to their appealing performance-power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important ...
RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors

Science.gov (United States)

Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

2015-09-01

This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.
State-based Communication on Time-predictable Multicore Processors

DEFF Research Database (Denmark)

Sørensen, Rasmus Bo; Schoeberl, Martin; Sparsø, Jens

2016-01-01

Some real-time systems use a form of task-to-task communication called state-based or sample-based communication that does not impose any flow control among the communicating tasks. The concept is similar to a shared variable, where a reader may read the same value multiple times or may not read...... a given value at all. This paper explores time-predictable implementations of state-based communication in network-on-chip based multicore platforms through five algorithms. With the presented analysis of the implemented algorithms, the communicating tasks of one core can be scheduled independently...... of tasks on other cores. Assuming a specific time-predictable multicore processor, we evaluate how the read and write primitives of the five algorithms contribute to the worst-case execution time of the communicating tasks. Each of the five algorithms has specific capabilities that make them suitable...
Support for the Logical Execution Time Model on a Time-predictable Multicore Processor

DEFF Research Database (Denmark)

Kluge, Florian; Schoeberl, Martin; Ungerer, Theo

2016-01-01

The logical execution time (LET) model increases the compositionality of real-time task sets. Removal or addition of tasks does not influence the communication behavior of other tasks. In this work, we extend a multicore operating system running on a time-predictable multicore processor to support...... the LET model. For communication between tasks we use message passing on a time-predictable network-on-chip to avoid the bottleneck of shared memory. We report our experiences and present results on the costs in terms of memory and execution time....
Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs

OpenAIRE

Butko , Anastasiia; Bruguier , Florent; Gamatié , Abdoulaye; Sassatelli , Gilles

2017-01-01

International audience; ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider differ...
A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

Energy Technology Data Exchange (ETDEWEB)

Aliaga, José I., E-mail: aliaga@uji.es [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Alonso, Pedro [Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València (Spain); Badía, José M. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Chacón, Pablo [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Davidović, Davor [Rudjer Bošković Institute, Centar za Informatiku i Računarstvo – CIR, Zagreb (Croatia); López-Blanco, José R. [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Quintana-Ortí, Enrique S. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain)

2016-03-15

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.
A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

International Nuclear Information System (INIS)

Aliaga, José I.; Alonso, Pedro; Badía, José M.; Chacón, Pablo; Davidović, Davor; López-Blanco, José R.; Quintana-Ortí, Enrique S.

2016-01-01

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.
Climbing Mont Blanc - A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

OpenAIRE

Natvig, Lasse; Follan, Torbjørn; Støa, Simen; Magnussen, Sindre; Guirado, Antonio Garcia

2015-01-01

Climbing Mont Blanc (CMB) is an open online judge used for training in energy efficient programming of state-of-the-art heterogeneous multicores. It uses an Odroid-XU3 board from Hardkernel with an Exynos Octa processor and integrated power sensors. This processor is three-way heterogeneous containing 14 different cores of three different types. The board currently accepts C and C++ programs, with support for OpenCL v1.1, OpenMP 4.0 and Pthreads. Programs submitted using the graphical user in...

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

OpenAIRE

Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2015-01-01

Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware ...
Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

OpenAIRE

Catalán, Sandra; Herrero, José R.; Igual, Francisco D.; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2015-01-01

Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the...
Multithreading for Embedded Reconfigurable Multicore Systems

NARCIS (Netherlands)

Zaykov, P.G.

2014-01-01

In this dissertation, we address the problem of performance efficient multithreading execution on heterogeneous multicore embedded systems. By heterogeneous multicore embedded systems we refer to those, which have real-time requirements and consist of processor tiles with General Purpose Processor
Multithreading for embedded reconfigurable multicore systems

NARCIS (Netherlands)

Zaykov, P.G.

2014-01-01

In this dissertation, we address the problem of performance efficient multithreading execution on heterogeneous multicore embedded systems. By heterogeneous multicore embedded systems we refer to those, which have real-time requirements and consist of processor tiles with General Purpose Processor
Mathematical Methods and Algorithms of Mobile Parallel Computing on the Base of Multi-core Processors

Directory of Open Access Journals (Sweden)

Alexander B. Bakulev

2012-11-01

Full Text Available This article deals with mathematical models and algorithms, providing mobility of sequential programs parallel representation on the high-level language, presents formal model of operation environment processes management, based on the proposed model of programs parallel representation, presenting computation process on the base of multi-core processors.
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

DEFF Research Database (Denmark)

Liu, Weifeng; Vinter, Brian

2014-01-01

and its child nodes must be executed sequentially, and (2) heaps, even d-heaps (d-ary heaps or d-way heaps), cannot supply enough wide data parallelism to these processors. Recent research proposed more versatile asymmetric multicore processors (AMPs) that consist of two types of cores (latency......-oriented cores with high single-thread performance and throughput-oriented cores with wide vector processing capability), unified memory address space and faster synchronization mechanism among cores with different ISAs. To leverage the AMPs for the heap data structure, in this paper we propose ad......-heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative...
On developing B-spline registration algorithms for multi-core processors

International Nuclear Information System (INIS)

Shackleford, J A; Kandasamy, N; Sharp, G C

2010-01-01

Spline-based deformable registration methods are quite popular within the medical-imaging community due to their flexibility and robustness. However, they require a large amount of computing time to obtain adequate results. This paper makes two contributions towards accelerating B-spline-based registration. First, we propose a grid-alignment scheme and associated data structures that greatly reduce the complexity of the registration algorithm. Based on this grid-alignment scheme, we then develop highly data parallel designs for B-spline registration within the stream-processing model, suitable for implementation on multi-core processors such as graphics processing units (GPUs). Particular attention is focused on an optimal method for performing analytic gradient computations in a data parallel fashion. CPU and GPU versions are validated for execution time and registration quality. Performance results on large images show that our GPU algorithm achieves a speedup of 15 times over the single-threaded CPU implementation whereas our multi-core CPU algorithm achieves a speedup of 8 times over the single-threaded implementation. The CPU and GPU versions achieve near-identical registration quality in terms of RMS differences between the generated vector fields.
Multicore technology architecture, reconfiguration, and modeling

CERN Document Server

Qadri, Muhammad Yasir

2013-01-01

The saturation of design complexity and clock frequencies for single-core processors has resulted in the emergence of multicore architectures as an alternative design paradigm. Nowadays, multicore/multithreaded computing systems are not only a de-facto standard for high-end applications, they are also gaining popularity in the field of embedded computing. The start of the multicore era has altered the concepts relating to almost all of the areas of computer architecture design, including core design, memory management, thread scheduling, application support, inter-processor communication, debu
Fundamentals of multicore software development

CERN Document Server

Pankratius, Victor; Tichy, Walter F

2011-01-01

With multicore processors now in every computer, server, and embedded device, the need for cost-effective, reliable parallel software has never been greater. By explaining key aspects of multicore programming, Fundamentals of Multicore Software Development helps software engineers understand parallel programming and master the multicore challenge. Accessible to newcomers to the field, the book captures the state of the art of multicore programming in computer science. It covers the fundamentals of multicore hardware, parallel design patterns, and parallel programming in C++, .NET, and Java. It
3D Seismic Imaging through Reverse-Time Migration on Homogeneous and Heterogeneous Multi-Core Processors

Directory of Open Access Journals (Sweden)

Mauricio Araya-Polo

2009-01-01

Full Text Available Reverse-Time Migration (RTM is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP in terms of performance (with an 15.0× speedup and energy-efficiency (with a 10.0× increase in the GFlops/W delivered. Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.
Self-powered information measuring wireless networks using the distribution of tasks within multicore processors

Science.gov (United States)

Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral

2017-08-01

The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

Science.gov (United States)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Summary of multi-core hardware and programming model investigations

Energy Technology Data Exchange (ETDEWEB)

Kelly, Suzanne Marie; Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.

2008-05-01

This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community.
Multicore Considerations for Legacy Flight Software Migration

Science.gov (United States)

Vines, Kenneth; Day, Len

2013-01-01

In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.
CQPSO scheduling algorithm for heterogeneous multi-core DAG task model

Science.gov (United States)

Zhai, Wenzheng; Hu, Yue-Li; Ran, Feng

2017-07-01

Efficient task scheduling is critical to achieve high performance in a heterogeneous multi-core computing environment. The paper focuses on the heterogeneous multi-core directed acyclic graph (DAG) task model and proposes a novel task scheduling method based on an improved chaotic quantum-behaved particle swarm optimization (CQPSO) algorithm. A task priority scheduling list was built. A processor with minimum cumulative earliest finish time (EFT) was acted as the object of the first task assignment. The task precedence relationships were satisfied and the total execution time of all tasks was minimized. The experimental results show that the proposed algorithm has the advantage of optimization abilities, simple and feasible, fast convergence, and can be applied to the task scheduling optimization for other heterogeneous and distributed environment.
NMR-MPar: A Fault-Tolerance Approach for Multi-Core and Many-Core Processors

Directory of Open Access Journals (Sweden)

Vanessa Vargas

2018-03-01

Full Text Available Multi-core and many-core processors are a promising solution to achieve high performance by maintaining a lower power consumption. However, the degree of miniaturization makes them more sensitive to soft-errors. To improve the system reliability, this work proposes a fault-tolerance approach based on redundancy and partitioning principles called N-Modular Redundancy and M-Partitions (NMR-MPar. By combining both principles, this approach allows multi-/many-core processors to perform critical functions in mixed-criticality systems. Benefiting from the capabilities of these devices, NMR-MPar creates different partitions that perform independent functions. For critical functions, it is proposed that N partitions with the same configuration participate of an N-modular redundancy system. In order to validate the approach, a case study is implemented on the KALRAY Multi-Purpose Processing Array (MPPA-256 many-core processor running two parallel benchmark applications. The traveling salesman problem and matrix multiplication applications were selected to test different device’s resources. The effectiveness of NMR-MPar is assessed by software-implemented fault-injection. For evaluation purposes, it is considered that the system is intended to be used in avionics. Results show the improvement of the application reliability by two orders of magnitude when implementing NMR-MPar on the system. Finally, this work opens the possibility to use massive parallelism for dependable applications in embedded systems.
Power-Energy Simulation for Multi-Core Processors in Bench-marking

Directory of Open Access Journals (Sweden)

Mona A. Abou-Of

2017-01-01

Full Text Available At Microarchitectural level, multi-core processor, as a complex System on Chip, has sophisticated on-chip components including cores, shared caches, interconnects and system controllers such as memory and ethernet controllers. At technological level, architects should consider the device types forecast in the International Technology Roadmap for Semiconductors (ITRS. Energy simulation enables architects to study two important metrics simultaneously. Timing is a key element of the CPU performance that imposes constraints on the CPU target clock frequency. Power and the resulting heat impose more severe design constraints, such as core clustering, while semiconductor industry is providing more transistors in the die area in pace with Moore’s law. Energy simulators provide a solution for such serious challenge. Energy is modelled either by combining performance benchmarking tool with a power simulator or by an integrated framework of both performance simulator and power proﬁling system. This article presents and asses trade-oﬀs between diﬀerent architectures using four cores battery-powered mobile systems by running a custom-made and a standard benchmark tools. The experimental results assure the Energy/ Frequency convexity rule over a range of frequency settings on diﬀerent number of enabled cores. The reported results show that increasing the number of cores has a great eﬀect on increasing the power consumption. However, a minimum energy dissipation will occur at a lower frequency which reduces the power consumption. Despite that, increasing the number of cores will also increase the eﬀective cores value which will reﬂect a better processor performance.
Parallel optimization of IDW interpolation algorithm on multicore platform

Science.gov (United States)

Guan, Xuefeng; Wu, Huayi

2009-10-01

Due to increasing power consumption, heat dissipation, and other physical issues, the architecture of central processing unit (CPU) has been turning to multicore rapidly in recent years. Multicore processor is packaged with multiple processor cores in the same chip, which not only offers increased performance, but also presents significant challenges to application developers. As a matter of fact, in GIS field most of current GIS algorithms were implemented serially and could not best exploit the parallelism potential on such multicore platforms. In this paper, we choose Inverse Distance Weighted spatial interpolation algorithm (IDW) as an example to study how to optimize current serial GIS algorithms on multicore platform in order to maximize performance speedup. With the help of OpenMP, threading methodology is introduced to split and share the whole interpolation work among processor cores. After parallel optimization, execution time of interpolation algorithm is greatly reduced and good performance speedup is achieved. For example, performance speedup on Intel Xeon 5310 is 1.943 with 2 execution threads and 3.695 with 4 execution threads respectively. An additional output comparison between pre-optimization and post-optimization is carried out and shows that parallel optimization does to affect final interpolation result.
Evaluating Multicore Algorithms on the Unified Memory Model

Directory of Open Access Journals (Sweden)

John E. Savage

2009-01-01

Full Text Available One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores. It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.
CMS readiness for multi-core workload scheduling

Science.gov (United States)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

2017-10-01

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.

CMS Readiness for Multi-Core Workload Scheduling

Energy Technology Data Exchange (ETDEWEB)

Perez-Calero Yzquierdo, A. [Madrid, CIEMAT; Balcas, J. [Caltech; Hernandez, J. [Madrid, CIEMAT; Aftab Khan, F. [NCP, Islamabad; Letts, J. [UC, San Diego; Mason, D. [Fermilab; Verguilov, V. [CLMI, Sofia

2017-11-22

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

Science.gov (United States)

Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

2018-02-01

Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

Science.gov (United States)

Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

2010-01-11

Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
Fuzzy logic based power-efficient real-time multi-core system

CERN Document Server

Ahmed, Jameel; Najam, Shaheryar; Najam, Zohaib

2017-01-01

This book focuses on identifying the performance challenges involved in computer architectures, optimal configuration settings and analysing their impact on the performance of multi-core architectures. Proposing a power and throughput-aware fuzzy-logic-based reconfiguration for Multi-Processor Systems on Chip (MPSoCs) in both simulation and real-time environments, it is divided into two major parts. The first part deals with the simulation-based power and throughput-aware fuzzy logic reconfiguration for multi-core architectures, presenting the results of a detailed analysis on the factors impacting the power consumption and performance of MPSoCs. In turn, the second part highlights the real-time implementation of fuzzy-logic-based power-efficient reconfigurable multi-core architectures for Intel and Leone3 processors. .
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

Science.gov (United States)

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

Science.gov (United States)

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Fast processor for dilepton triggers

International Nuclear Information System (INIS)

Katsanevas, S.; Kostarakis, P.; Baltrusaitis, R.

1983-01-01

We describe a fast trigger processor, developed for and used in Fermilab experiment E-537, for selecting high-mass dimuon events produced by negative pions and anti-protons. The processor finds candidate tracks by matching hit information received from drift chambers and scintillation counters, and determines their momenta. Invariant masses are calculated for all possible pairs of tracks and an event is accepted if any invariant mass is greater than some preselectable minimum mass. The whole process, accomplished within 5 to 10 microseconds, achieves up to a ten-fold reduction in trigger rate
Real-Time Audio Processing on the T-CREST Multicore Platform

DEFF Research Database (Denmark)

Ausin, Daniel Sanz; Pezzarossa, Luca; Schoeberl, Martin

2017-01-01

of the audio signal. This paper presents a real-time multicore audio processing system based on the T-CREST platform. T-CREST is a time-predictable multicore processor for real-time embedded systems. Multiple audio effect tasks have been implemented, which can be connected together in different configurations...... forming sequential and parallel effect chains, and using a network-onchip for intercommunication between processors. The evaluation of the system shows that real-time processing of multiple effect configurations is possible, and that the estimation and control of latency ensures real-time behavior.......Multicore platforms are nowadays widely used for audio processing applications, due to the improvement of computational power that they provide. However, some of these systems are not optimized for temporally constrained environments, which often leads to an undesired increase in the latency...
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

Directory of Open Access Journals (Sweden)

Pielot Rainer

2010-01-01

Full Text Available Abstract Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE, a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
Scalable Parallelization of Skyline Computation for Multi-core Processors

DEFF Research Database (Denmark)

Chester, Sean; Sidlauskas, Darius; Assent, Ira

2015-01-01

The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation...... of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multi-core skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads...
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

Science.gov (United States)

Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Multi-core System Architecture for Safety-critical Control Applications

DEFF Research Database (Denmark)

Li, Gang

and size, and high power consumption. Increasing the frequency of a processor is becoming painful now due to the explosive power consumption. Furthermore, components integrated into a single-core processor have to be certified to the highest SIL, due to that no isolation is provided in a traditional single...... certification cost. Meanwhile, hardware platforms with improved processing power are required to execute the applications of larger size. To tackle the two issues mentioned above, the state of the art approaches are using more Electronic Control Units (ECU) in a federated architecture or increasing......-core processor. A promising alternative to improve processing power and provide isolation is to adopt a multi-core architecture with on-chip isolation. In general, a specific multi-core architecture can facilitate the development and certification of safety-related systems, due to its physical isolation between...
A lock circuit for a multi-core processor

DEFF Research Database (Denmark)

2015-01-01

An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores...... that are enqueued in the queue register. Furthermore, the integrated circuit comprises a current register and a selector circuit configured to select a processor core and identify that processor core by a value in the current register. A selected processor core is a prioritized processor core among the cores...... configured with an integrated circuit; and a silicon die configured with an integrated circuit....
High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

Directory of Open Access Journals (Sweden)

H. Y. Su

2012-04-01

Full Text Available This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms.
The fast tracker processor for hadronic collider triggers

CERN Document Server

Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, F; Pietri, M; Varotto, G

2000-01-01

Perspective for precise and fast track reconstruction in future hadronic collider experiments are addressed. We discuss the feasibility of a pipelined highly parallelized processor dedicated to the implementation of a very fast algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points (patterns) for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at a rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution traces with transverse momentum above few GeV and search secondary vertexes within typical level-2 times. 15 Refs.
Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems

Directory of Open Access Journals (Sweden)

Zhihui Du

2017-01-01

Full Text Available We describe a family of power models that can capture the nonuniform power effects of speed scaling among homogeneous cores on multicore processors. These models depart from traditional ones, which assume that individual cores contribute to power consumption as independent entities. In our approach, we remove this independence assumption and employ statistical variables of core speed (average speed and the dispersion of the core speeds to capture the comprehensive heterogeneous impact of subtle interactions among the underlying hardware. We systematically explore the model family, deriving basic and refined models that give progressively better fits, and analyze them in detail. The proposed methodology provides an easy way to build power models to reflect the realistic workings of current multicore processors more accurately. Moreover, unlike the existing lower-level power models that require knowledge of microarchitectural details of the CPU cores and the last level cache to capture core interdependency, ours are easier to use and scalable to emerging and future multicore architectures with more cores. These attributes make the models particularly useful to system users or algorithm designers who need a quick way to estimate power consumption. We evaluate the family of models on contemporary x86 multicore processors using the SPEC2006 benchmarks. Our best model yields an average predicted error as low as 5%.
CMS multicore scheduling strategy

International Nuclear Information System (INIS)

Yzquierdo, Antonio Pérez-Calero; Hernández, Jose; Holzman, Burt; Majewski, Krista; McCrea, Alison

2014-01-01

In the next years, processor architectures based on much larger numbers of cores will be most likely the model to continue 'Moore's Law' style throughput gains. This not only results in many more jobs in parallel running the LHC Run 1 era monolithic applications, but also the memory requirements of these processes push the workernode architectures to the limit. One solution is parallelizing the application itself, through forking and memory sharing or through threaded frameworks. CMS is following all of these approaches and has a comprehensive strategy to schedule multicore jobs on the GRID based on the glideinWMS submission infrastructure. The main component of the scheduling strategy, a pilot-based model with dynamic partitioning of resources that allows the transition to multicore or whole-node scheduling without disallowing the use of single-core jobs, is described. This contribution also presents the experiences made with the proposed multicore scheduling schema and gives an outlook of further developments working towards the restart of the LHC in 2015.
The fast tracker processor for hadron collider triggers

CERN Document Server

Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, E; Pietri, M; Varotto, G

2001-01-01

Perspectives for precise and fast track reconstruction in future hadron collider experiments are addressed. We discuss the feasibility of a pipelined highly parallel processor dedicated to the implementation of a very fast tracking algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points, called patterns, for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at an input rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution tracks with transverse momentum above a few GeV and search for secondary vertices within typical level-2 times. (15 refs).
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

Directory of Open Access Journals (Sweden)

Youngmin Kim

2017-12-01

Full Text Available In wireless sensor networks (WSNs, sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods.
Using Multi-Core Systems for Rover Autonomy

Science.gov (United States)

Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.

2010-01-01

Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality

Multicore systems on-chip practical software/hardware design

CERN Document Server

Abdallah, Abderazek Ben

2013-01-01

System on chips designs have evolved from fairly simple unicore, single memory designs to complex heterogeneous multicore SoC architectures consisting of a large number of IP blocks on the same silicon. To meet high computational demands posed by latest consumer electronic devices, most current systems are based on such paradigm, which represents a real revolution in many aspects in computing.The attraction of multicore processing for power reduction is compelling. By splitting a set of tasks among multiple processor cores, the operating frequency necessary for each core can be reduced, allowi
Energy Efﬁcient Multi-Core Processing

Directory of Open Access Journals (Sweden)

Charles Leech

2014-06-01

Full Text Available This paper evaluates the present state of the art of energy-efficient embedded processor design techniques and demonstrates, how small, variable-architecture embedded processors may exploit a run-time minimal architectural synthesis technique to achieve greater energy and area efficiency whilst maintaining performance. The picoMIPS architecture is presented, inspired by the MIPS, as an example of a minimal and energy efficient processor. The picoMIPS is a variablearchitecture RISC microprocessor with an application-specific minimised instruction set. Each implementation will contain only the necessary datapath elements in order to maximise area efficiency. Due to the relationship between logic gate count and power consumption, energy efficiency is also maximised in the processor therefore the system is designed to perform a specific task in the most efficient processor-based form. The principles of the picoMIPS processor are illustrated with an example of the discrete cosine transform (DCT and inverse DCT (IDCT algorithms implemented in a multi-core context to demonstrate the concept of minimal architecture synthesis and how it can be used to produce an application specific, energy efficient processor.
Multicore Challenges and Benefits for High Performance Scientific Computing

Directory of Open Access Journals (Sweden)

Ida M.B. Nielsen

2008-01-01

Full Text Available Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.
Fast track trigger processor for the OPAL detector at LEP

Energy Technology Data Exchange (ETDEWEB)

Carter, A A; Carter, J R; Ward, D R; Heuer, R D; Jaroslawski, S; Wagner, A

1986-09-20

A fast hardware track trigger processor being built for the OPAL experiment is described. The processor will analyse data from the central drift chambers of OPAL to determine whether any tracks come from the interaction region, and thereby eliminate background events. The processor will find tracks over a large angular range, vertical strokecos thetavertical stroke < or approx. 0.95. The design of the processor is described, together with a brief account of its hardware implementation for OPAL. The results of feasibility studies are also presented.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Science.gov (United States)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Fast digital processor for event selection according to particle number difference

International Nuclear Information System (INIS)

Basiladze, S.G.; Gus'kov, B.N.; Li Van Sun; Maksimov, A.N.; Parfenov, A.N.

1978-01-01

A fast digital processor for a magnetic spectrometer is described. It is used in experimental searches for charmed particles. The basic purpose of the processor is discriminating events in the difference of numbers of particles passing through two proportional chambers (PC). The processor consists of three units for detecting signals with PC, and a binary coder. The number of inputs of the processor is 32 for the first PC and 64 for the second. The difference in the number of particles discriminated is from 0 to 8. The resolution time is 180 ns. The processor is built in the CAMAC standard
A fast processor for di-lepton triggers

CERN Document Server

Kostarakis, P; Barsotti, E; Conetti, S; Cox, B; Enagonio, J; Haldeman, M; Haynes, W; Katsanevas, S; Kerns, C; Lebrun, P; Smith, H; Soszyniski, T; Stoffel, J; Treptow, K; Turkot, F; Wagner, R

1981-01-01

As a new application of the Fermilab ECL-CAMAC logic modules a fast trigger processor was developed for Fermilab experiment E-537, aiming to measure the higher mass di-muon production by antiprotons. The processor matches the hit information received from drift chambers and scintillation counters, to find candidate muon tracks and determine their directions and momenta. The tracks are then paired to compute an invariant mass: when the computed mass falls within the desired range, the event is accepted. The process is accomplished in times of 5 to 10 microseconds, while achieving a trigger rate reduction of up to a factor of ten. (5 refs).
Timing organization of a real-time multicore processor

DEFF Research Database (Denmark)

Schoeberl, Martin; Sparsø, Jens

2017-01-01

Real-time systems need a time-predictable computing platform. Computation, communication, and access to shared resources needs to be time-predictable. We use time division multiplexing to statically schedule all computation and communication resources, such as access to main memory or message...... passing over a network-on-chip. We use time-driven communication over an asynchronous network-on-chip to enable time division multiplexing even in a globally asynchronous, locally synchronous multicore architecture. Using time division multiplexing at all levels of the architecture yields in a time...
A fast track trigger processor for the OPAL detector at LEP

International Nuclear Information System (INIS)

Carter, A.A.; Jaroslawski, S.; Wagner, A.

1986-01-01

A fast hardware track trigger processor being built for the OPAL experiment is described. The processor will analyse data from the central drift chambers of OPAL to determine whether any tracks come from the interaction region, and thereby eliminate background events. The processor will find tracks over a large angular range, vertical strokecos thetavertical stroke < or approx. 0.95. The design of the processor is described, together with a brief account of its hardware implementation for OPAL. The results of feasibility studies are also presented. (orig.)
Fast data reconstructed method of Fourier transform imaging spectrometer based on multi-core CPU

Science.gov (United States)

Yu, Chunchao; Du, Debiao; Xia, Zongze; Song, Li; Zheng, Weijian; Yan, Min; Lei, Zhenggang

2017-10-01

Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the HyperSpectral Imager of `HJ-1' Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.
A highly efficient multi-core algorithm for clustering extremely large datasets

Directory of Open Access Journals (Sweden)

Kraus Johann M

2010-04-01

Full Text Available Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.
HardwareSoftware Co-design for Heterogeneous Multi-core Platforms The hArtes Toolchain

CERN Document Server

2012-01-01

This book describes the results and outcome of the FP6 project, known as hArtes, which focuses on the development of an integrated tool chain targeting a heterogeneous multi core platform comprising of a general purpose processor (ARM or powerPC), a DSP (the diopsis) and an FPGA. The tool chain takes existing source code and proposes transformations and mappings such that legacy code can easily be ported to a modern, multi-core platform. Benefits of the hArtes approach, described in this book, include: Uses a familiar programming paradigm: hArtes proposes a familiar programming paradigm which is compatible with the widely used programming practice, irrespective of the target platform. Enables users to view multiple cores as a single processor: the hArtes approach abstracts away the heterogeneity as well as the multi-core aspect of the underlying hardware so the developer can view the platform as consisting of a single, general purpose processor. Facilitates easy porting of existing applications: hArtes provid...
LDRD final report : a lightweight operating system for multi-core capability class supercomputers.

Energy Technology Data Exchange (ETDEWEB)

Kelly, Suzanne Marie; Hudson, Trammell B. (OS Research); Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico); Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.; Brightwell, Ronald Brian

2010-09-01

The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.
MORPION: a fast hardware processor for straight line finding in MWPC

International Nuclear Information System (INIS)

Mur, M.

1980-02-01

A fast hardware processor for straight line finding in MWPC has been built in Saclay and successfully operated in the NA3 experiment at CERN. We give the motivations to build this processor, and describe the hardware implementation of the line finding algorithm. Finally its use and performance in NA3 are described
Exploring performance and power properties of modern multicore chips via simple machine models

OpenAIRE

Hager, Georg; Treibig, Jan; Habich, Johannes; Wellein, Gerhard

2012-01-01

Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the wel...
Evolution of CMS Workload Management Towards Multicore Job Support

Energy Technology Data Exchange (ETDEWEB)

Perez-Calero Yzquierdo, A. [Madrid, CIEMAT; Hernández, J. M. [Madrid, CIEMAT; Khan, F. A. [Quaid-i-Azam U.; Letts, J. [UC, San Diego; Majewski, K. [Fermilab; Rodrigues, A. M. [Fermilab; McCrea, A. [UC, San Diego; Vaandering, E. [Fermilab

2015-12-23

The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.
Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

Directory of Open Access Journals (Sweden)

Hatem Ltaief

2010-01-01

Full Text Available The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009, 38–53 introduced the concept of tile algorithms in which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.
Matrix multiplication with a hypercube algorithm on multi-core processor cluster

Directory of Open Access Journals (Sweden)

José Crispín Zavala-Díaz

2015-01-01

Full Text Available Se analiza, modifica e implementa el algoritmo de multiplicación de matrices de Dekel, Nassimi y Sahani o hipercubo en un cluster de procesadores multi-core, donde el número de procesadores utilizado es menor al requerido por el algoritmo de n3. Se utilizan 23, 43 y 83 unidades procesadoras para multiplicar matrices de orden de magnitud de 10X10, 102X102 y 103X103. Los resultados del modelo matemático del algoritmo modificado y los obtenidos de la experimentación computacional muestran que es posible alcanzar rapidez y eficiencias paralelas aceptables, en función del número de unidades procesadoras utilizadas. También se muestra que la influencia del enlace externo de comunicación entre los nodos disminuye si se utiliza una combinación de los canales de comunicación disponibles entre los núcleos en un cluster multi-core.
Pixel detector bias supply and control using embedded multicore processors

CERN Document Server

AUTHOR|(CDS)2099144; Akram Alomainy

The aim of the project is to create a software controlled, open source, low footprint and low power high voltage bias supply and current monitor for a pixelated radiation sensor. The solution is based on the LT3905 integrated circuit and the multi-core XMOS xCore 200 microcontroller and it is intended to be used in a battery powered, mobile platform for educational settings.
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

Energy Technology Data Exchange (ETDEWEB)

Su, Chun-Yi [Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)

2014-12-16

By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access

A high performance load balance strategy for real-time multicore systems.

Science.gov (United States)

Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

2014-01-01

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.
FAST: framework for heterogeneous medical image computing and visualization.

Science.gov (United States)

Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank

2015-11-01

Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.
A Fast CT Reconstruction Scheme for a General Multi-Core PC

Directory of Open Access Journals (Sweden)

Kai Zeng

2007-01-01

Full Text Available Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT angiography (BCA that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
A fast CT reconstruction scheme for a general multi-core PC.

Science.gov (United States)

Zeng, Kai; Bai, Erwei; Wang, Ge

2007-01-01

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

Energy Technology Data Exchange (ETDEWEB)

Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

International Nuclear Information System (INIS)

Barhen, Jacob; Kerekes, Ryan A.; St Charles, Jesse Lee; Buckner, Mark A.

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core
High performance parallelism pearls 2 multicore and many-core programming approaches

CERN Document Server

Jeffers, Jim

2015-01-01

High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

Energy Technology Data Exchange (ETDEWEB)

Schmalz, Mark S

2011-07-24

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G} for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient
A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

DEFF Research Database (Denmark)

Li, Gang; Guan, Wei; Sierszecki, Krzysztof

2012-01-01

Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints. This pa......Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints......). Second, a heterogeneous multi-core architecture is investigated, focusing on its performance in relation to hard real-time constraints and predictable behavior. Third, the hardware implementation of HARTEX is designated to support the heterogeneous multi-core architecture. This hardware kernel has...... several advantages over a similar kernel implemented in software: higher-speed processing capability, parallel computation, and separation between the kernel itself and the applications being run. A microbenchmark has been used to compare the hardware kernel with the software kernel, and compare...
Reducing Competitive Cache Misses in Modern Processor Architectures

OpenAIRE

Prisagjanec, Milcho; Mitrevski, Pece

2017-01-01

The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. This tec...
A High Performance Load Balance Strategy for Real-Time Multicore Systems

Directory of Open Access Journals (Sweden)

Keng-Mao Cho

2014-01-01

Full Text Available Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS. Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.
Fast algorithms for coordinate processors in Galois field for multiplicity t = 4.5 and t > 5

International Nuclear Information System (INIS)

Nikityuk, N.M.

1989-01-01

Fast algorithms for solving the coordinate equations for special-purpose processors at multiplicity t = 4.5 and t > 5 are described. Block diagrams of coordinate processor for t 4 in Galois field GF(2 m ) is presented which is solved by a table method. Economical algorithms for solving the coordinate equations by serial methods at t > 5 are described. The algorithms and devices proposed could be applied when creating fast processors in high energy physics spectrometers. 9 refs.; 3 figs
Integral Fast Reactor fuel pin processor

International Nuclear Information System (INIS)

Levinskas, D.

1993-01-01

This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves
Concurrent Programming in Mac OS X and iOS Unleash Multicore Performance with Grand Central Dispatch

CERN Document Server

Nahavandipoor, Vandad

2011-01-01

Now that multicore processors are coming to mobile devices, wouldn't it be great to take advantage of all those cores without having to manage threads? This concise book shows you how to use Apple's Grand Central Dispatch (GCD) to simplify programming on multicore iOS devices and Mac OS X. Managing your application's resources on more than one core isn't easy, but it's vital. Apps that use only one core in a multicore environment will slow to a crawl. If you know how to program with Cocoa or Cocoa Touch, this guide will get you started with GCD right away, with many examples to help you writ
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Energy Technology Data Exchange (ETDEWEB)

Madduri, Kamesh [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Williams, Samuel [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Ethier, Stephane [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Shalf, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Strohmaier, Erich [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelicky, Katherine [Univ. of California, Berkeley, CA (United States)

2009-01-01

We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.
Low pain vs no pain multicore Haskells

DEFF Research Database (Denmark)

Aswad, Mustafa; Trinder, Phil; Al Zain, Abdallah

2010-01-01

the programming effort eachlanguage requires with the parallel performance delivered. The study uses 15’typical’ programs to compare a ’no pain’, i.e. entirely implicit, parallel languagewith three ’low pain’, i.e. semi-explicit languages.The parallel Haskell implementations use different versions of GHC......Multicores are becoming the dominant processor technology andfunctional languages are theoretically well suited to exploit them. In practice,however, implementing effective high level parallel functional languages is extremelychallenging.This paper is the first programming and performance...
The ATLAS fast tracker processor design

CERN Document Server

Volpi, Guido; Albicocco, Pietro; Alison, John; Ancu, Lucian Stefan; Anderson, James; Andari, Nansi; Andreani, Alessandro; Andreazza, Attilio; Annovi, Alberto; Antonelli, Mario; Asbah, Needa; Atkinson, Markus; Baines, J; Barberio, Elisabetta; Beccherle, Roberto; Beretta, Matteo; Biesuz, Nicolo Vladi; Blair, R E; Bogdan, Mircea; Boveia, Antonio; Britzger, Daniel; Bryant, Partick; Burghgrave, Blake; Calderini, Giovanni; Camplani, Alessandra; Cavaliere, Viviana; Cavasinni, Vincenzo; Chakraborty, Dhiman; Chang, Philip; Cheng, Yangyang; Citraro, Saverio; Citterio, Mauro; Crescioli, Francesco; Dawe, Noel; Dell'Orso, Mauro; Donati, Simone; Dondero, Paolo; Drake, G; Gadomski, Szymon; Gatta, Mauro; Gentsos, Christos; Giannetti, Paola; Gkaitatzis, Stamatios; Gramling, Johanna; Howarth, James William; Iizawa, Tomoya; Ilic, Nikolina; Jiang, Zihao; Kaji, Toshiaki; Kasten, Michael; Kawaguchi, Yoshimasa; Kim, Young Kee; Kimura, Naoki; Klimkovich, Tatsiana; Kolb, Mathis; Kordas, K; Krizka, Karol; Kubota, T; Lanza, Agostino; Li, Ho Ling; Liberali, Valentino; Lisovyi, Mykhailo; Liu, Lulu; Love, Jeremy; Luciano, Pierluigi; Luongo, Carmela; Magalotti, Daniel; Maznas, Ioannis; Meroni, Chiara; Mitani, Takashi; Nasimi, Hikmat; Negri, Andrea; Neroutsos, Panos; Neubauer, Mark; Nikolaidis, Spiridon; Okumura, Y; Pandini, Carlo; Petridou, Chariclia; Piendibene, Marco; Proudfoot, James; Rados, Petar Kevin; Roda, Chiara; Rossi, Enrico; Sakurai, Yuki; Sampsonidis, Dimitrios; Saxon, James; Schmitt, Stefan; Schoening, Andre; Shochet, Mel; Shoijaii, Jafar; Soltveit, Hans Kristian; Sotiropoulou, Calliope-Louisa; Stabile, Alberto; Swiatlowski, Maximilian J; Tang, Fukun; Taylor, Pierre Thor Elliot; Testa, Marianna; Tompkins, Lauren; Vercesi, V; Wang, Rui; Watari, Ryutaro; Zhang, Jianhong; Zeng, Jian Cong; Zou, Rui; Bertolucci, Federico

2015-01-01

The extended use of tracking information at the trigger level in the LHC is crucial for the trigger and data acquisition (TDAQ) system to fulfill its task. Precise and fast tracking is important to identify specific decay products of the Higgs boson or new phenomena, as well as to distinguish the contributions coming from the many collisions that occur at every bunch crossing. However, track reconstruction is among the most demanding tasks performed by the TDAQ computing farm; in fact, complete reconstruction at full Level-1 trigger accept rate (100 kHz) is not possible. In order to overcome this limitation, the ATLAS experiment is planning the installation of a dedicated processor, the Fast Tracker (FTK), which is aimed at achieving this goal. The FTK is a pipeline of high performance electronics, based on custom and commercial devices, which is expected to reconstruct, with high resolution, the trajectories of charged-particle tracks with a transverse momentum above 1 GeV, using the ATLAS inner tracker info...
High-Performance Parallel and Stream Processing of X-ray Microdiffraction Data on Multicores

International Nuclear Information System (INIS)

Bauer, Michael A; McIntyre, Stewart; Xie Yuzhen; Biem, Alain; Tamura, Nobumichi

2012-01-01

We present the design and implementation of a high-performance system for processing synchrotron X-ray microdiffraction (XRD) data in IBM InfoSphere Streams on multicore processors. We report on the parallel and stream processing techniques that we use to harvest the power of clusters of multicores to analyze hundreds of gigabytes of synchrotron XRD data in order to reveal the microtexture of polycrystalline materials. The timing to process one XRD image using one pipeline is about ten times faster than the best C program at present. With the support of InfoSphere Streams platform, our software is able to be scaled up to operate on clusters of multi-cores for processing multiple images concurrently. This system provides a high-performance processing kernel to achieve near real-time data analysis of image data from synchrotron experiments.
Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

Science.gov (United States)

2013-02-01

Playstation 3 with 6 available SPU cores outperforms the Intel Xeon processor (with 4 cores) by about 1.9 times for the HTM model and by 2.4 times...runtime breakdowns of the HTM and Dean models respectively on the Cell processor (on the Playstation 3) and the Intel Xeon processor ( 4 thread...YOUR FORM TO THE ABOVE ORGANIZATION. 1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 3. DATES COVERED (From - To) 4 . TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Optimization of multi-phase compressible lattice Boltzmann codes on massively parallel multi-core systems

NARCIS (Netherlands)

Biferale, L.; Mantovani, F.; Pivanti, M.; Pozzati, F.; Sbragaglia, M.; Schifano, S.F.; Toschi, F.; Tripiccione, R.

2011-01-01

We develop a Lattice Boltzmann code for computational fluid-dynamics and optimize it for massively parallel systems based on multi-core processors. Our code describes 2D multi-phase compressible flows. We analyze the performance bottlenecks that we find as we gradually expose a larger fraction of

The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

CERN Document Server

Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

2015-01-01

The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...
The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

CERN Document Server

Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

2015-01-01

The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...
Dealing with BIG Data - Exploiting the Potential of Multicore Parallelism and Auto-Tuning

CERN Multimedia

CERN. Geneva

2012-01-01

Physics experiments nowadays produce tremendous amounts of data that require sophisticated analyses in order to gain new insights. At such large scale, scientists are facing non-trivial software engineering problems in addition to the physics problems. Ubiquitous multicore processors and GPGPUs have turned almost any computer into a parallel machine and have pushed compute clusters and clouds to become multicore-based and more heterogenous. These developments complicate the exploitation of various types of parallelism within different layers of hardware and software. As a consequence, manual performance tuning is non-intuitive and tedious due to the large search space spanned by numerous inter-related tuning parameters. This talk addresses these challenges at CERN and discusses how to leverage multicore parallelization techniques in this context. It presents recent advances in automatic performance tuning to algorithmically find sweet spots with good performance. The talk also presents results from empiri...
A Fault Detection Mechanism in a Data-flow Scheduled Multithreaded Processor

NARCIS (Netherlands)

Fu, J.; Yang, Q.; Poss, R.; Jesshope, C.R.; Zhang, C.

2014-01-01

This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled MultiThreaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection
Optimization of the coherence function estimation for multi-core central processing unit

Science.gov (United States)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
Synthetic Aperture Sequential Beamforming implemented on multi-core platforms

DEFF Research Database (Denmark)

Kjeldsen, Thomas; Lassen, Lee; Hemmsen, Martin Christian

2014-01-01

This paper compares several computational ap- proaches to Synthetic Aperture Sequential Beamforming (SASB) targeting consumer level parallel processors such as multi-core CPUs and GPUs. The proposed implementations demonstrate that ultrasound imaging using SASB can be executed in real- time with ...... per second) on an Intel Core i7 2600 CPU with an AMD HD7850 and a NVIDIA GTX680 GPU. The fastest CPU and GPU implementations use 14% and 1.3% of the real-time budget of 62 ms/frame, respectively. The maximum achieved processing rate is 1265 frames/s....
Numerical Solution of Diffusion Models in Biomedical Imaging on Multicore Processors

Directory of Open Access Journals (Sweden)

Luisa D'Amore

2011-01-01

Full Text Available In this paper, we consider nonlinear partial differential equations (PDEs of diffusion/advection type underlying most problems in image analysis. As case study, we address the segmentation of medical structures. We perform a comparative study of numerical algorithms arising from using the semi-implicit and the fully implicit discretization schemes. Comparison criteria take into account both the accuracy and the efficiency of the algorithms. As measure of accuracy, we consider the Hausdorff distance and the residuals of numerical solvers, while as measure of efficiency we consider convergence history, execution time, speedup, and parallel efficiency. This analysis is carried out in a multicore-based parallel computing environment.
A fast inner product processor based on equal alignments

Energy Technology Data Exchange (ETDEWEB)

Smith, S.P.; Torng, H.C.

1985-11-01

Inner product computation is an important operation, invoked repeatedly in matrix multiplications. A high-speed inner product processor can be very useful (among many possible applications) in real-time signal processing. This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost. The inner product processor is implemented with a tree of carry-propagate or carry-save adders; this structure is obtained with the incorporation of three innovations in the conventional multiply/add tree: The leaf-multipliers are expanded into adder subtrees, thus achieving an O(log Nb) latency, where N denotes the number of elements in a vector and b the number of bits in each element. The partial products, to be summed in producing an inner product, are reordered according to their ''minimum alignments.'' This reordering brings approximately a 20% savings in hardware-including adders and data paths. The reduction in adder widths also yields savings in carry propagation time for carry-propagate adders. For trees implemented with carry-save adders, the partial product reordering also serves to truncate the carry propagation chain in the final propagation stage by 2 log b - 1 positions, thus significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware.
Parallel Task Processing on a Multicore Platform in a PC-based Control System for Parallel Kinematics

Directory of Open Access Journals (Sweden)

Harald Michalik

2009-02-01

Full Text Available Multicore platforms are such that have one physical processor chip with multiple cores interconnected via a chip level bus. Because they deliver a greater computing power through concurrency, offer greater system density multicore platforms provide best qualifications to address the performance bottleneck encountered in PC-based control systems for parallel kinematic robots with heavy CPU-load. Heavy load control tasks are generated by new control approaches that include features like singularity prediction, structure control algorithms, vision data integration and similar tasks. In this paper we introduce the parallel task scheduling extension of a communication architecture specially tailored for the development of PC-based control of parallel kinematics. The Sche-duling is specially designed for the processing on a multicore platform. It breaks down the serial task processing of the robot control cycle and extends it with parallel task processing paths in order to enhance the overall control performance.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

Directory of Open Access Journals (Sweden)

Youngmin Kim

2016-07-01

Full Text Available Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM. Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
Fast Optimal Replica Placement with Exhaustive Search Using Dynamically Reconfigurable Processor

Directory of Open Access Journals (Sweden)

Hidetoshi Takeshita

2011-01-01

Full Text Available This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN. It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.
Silicon photonics for multicore fiber communication

DEFF Research Database (Denmark)

Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld

2016-01-01

We review our recent work on silicon photonics for multicore fiber communication, including multicore fiber fan-in/fan-out, multicore fiber switches towards reconfigurable optical add/drop multiplexers. We also present multicore fiber based quantum communication using silicon devices.......We review our recent work on silicon photonics for multicore fiber communication, including multicore fiber fan-in/fan-out, multicore fiber switches towards reconfigurable optical add/drop multiplexers. We also present multicore fiber based quantum communication using silicon devices....
SAD PROCESSOR FOR MULTIPLE MACROBLOCK MATCHING IN FAST SEARCH VIDEO MOTION ESTIMATION

Directory of Open Access Journals (Sweden)

Nehal N. Shah

2015-02-01

Full Text Available Motion estimation is a very important but computationally complex task in video coding. Process of determining motion vectors based on the temporal correlation of consecutive frame is used for video compression. In order to reduce the computational complexity of motion estimation and maintain the quality of encoding during motion compensation, different fast search techniques are available. These block based motion estimation algorithms use the sum of absolute difference (SAD between corresponding macroblock in current frame and all the candidate macroblocks in the reference frame to identify best match. Existing implementations can perform SAD between two blocks using sequential or pipeline approach but performing multi operand SAD in single clock cycle with optimized recourses is state of art. In this paper various parallel architectures for computation of the fixed block size SAD is evaluated and fast parallel SAD architecture is proposed with optimized resources. Further SAD processor is described with 9 processing elements which can be configured for any existing fast search block matching algorithm. Proposed SAD processor consumes 7% fewer adders compared to existing implementation for one processing elements. Using nine PE it can process 84 HD frames per second in worse case which is good outcome for real time implementation. In average case architecture process 325 HD frames per second.
XOP: A second generation fast processor for on-line use in high energy physics experiments

International Nuclear Information System (INIS)

Lingjaerde, T.

1981-01-01

Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fas execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor 'XOP' is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data files, a powerful data addressing unit for easy handling of vectors, as well as single operands, and a very flexible logic for conditional branching. Input/output will be made transparent through the introduction of internal fast processors which will be used in conjunction with powerful firmware as a software debugging aid. (orig.)
High-Level Design for Ultra-Fast Software Defined Radio Prototyping on Multi-Processors Heterogeneous Platforms

OpenAIRE

Moy , Christophe; Raulet , Mickaël

2010-01-01

International audience; The design of Software Defined Radio (SDR) equipments (terminals, base stations, etc.) is still very challenging. We propose here a design methodology for ultra-fast prototyping on heterogeneous platforms made of GPPs (General Purpose Processors), DSPs (Digital Signal Processors) and FPGAs (Field Programmable Gate Array). Lying on a component-based approach, the methodology mainly aims at automating as much as possible the design from an algorithmic validation to a mul...
Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

KAUST Repository

Malas, T.

2015-07-02

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.
Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

KAUST Repository

Malas, T.; Hager, G.; Ltaief, Hatem; Stengel, H.; Wellein, G.; Keyes, David E.

2015-01-01

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.
Fast Failure Recovery for Main-Memory DBMSs on Multicores

OpenAIRE

Wu, Yingjun; Guo, Wentian; Chan, Chee-Yong; Tan, Kian-Lee

2016-01-01

Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery wit...
Evaluating the scalability of HEP software and multi-core hardware

CERN Document Server

Jarp, S; Leduc, J; Nowak, A

2011-01-01

As researchers have reached the practical limits of processor performance improvements by frequency scaling, it is clear that the future of computing lies in the effective utilization of parallel and multi-core architectures. Since this significant change in computing is well underway, it is vital for HEP programmers to understand the scalability of their software on modern hardware and the opportunities for potential improvements. This work aims to quantify the benefit of new mainstream architectures to the HEP community through practical benchmarking on recent hardware solutions, including the usage of parallelized HEP applications.
Solving Matrix Equations on Multi-Core and Many-Core Architectures

Directory of Open Access Journals (Sweden)

Peter Benner

2013-11-01

Full Text Available We address the numerical solution of Lyapunov, algebraic and differential Riccati equations, via the matrix sign function, on platforms equipped with general-purpose multicore processors and, optionally, one or more graphics processing units (GPUs. In particular, we review the solvers for these equations, as well as the underlying methods, analyze their concurrency and scalability and provide details on their parallel implementation. Our experimental results show that this class of hardware provides sufficient computational power to tackle large-scale problems, which only a few years ago would have required a cluster of computers.

Safety-critical Java on a time-predictable processor

DEFF Research Database (Denmark)

Korsholm, Stephan E.; Schoeberl, Martin; Puffitsch, Wolfgang

2015-01-01

For real-time systems the whole execution stack needs to be time-predictable and analyzable for the worst-case execution time (WCET). This paper presents a time-predictable platform for safety-critical Java. The platform consists of (1) the Patmos processor, which is a time-predictable processor......; (2) a C compiler for Patmos with support for WCET analysis; (3) the HVM, which is a Java-to-C compiler; (4) the HVM-SCJ implementation which supports SCJ Level 0, 1, and 2 (for both single and multicore platforms); and (5) a WCET analysis tool. We show that real-time Java programs translated to C...... and compiled to a Patmos binary can be analyzed by the AbsInt aiT WCET analysis tool. To the best of our knowledge the presented system is the second WCET analyzable real-time Java system; and the first one on top of a RISC processor....
A general-purpose trigger processor system and its application to fast vertex trigger

International Nuclear Information System (INIS)

Hazumi, M.; Banas, E.; Natkaniec, Z.; Ostrowicz, W.

1997-12-01

A general-purpose hardware trigger system has been developed. The system comprises programmable trigger processors and pattern generator/samplers. The hardware design of the system is described. An application as a prototype of the very fast vertex trigger in an asymmetric B-factory at KEK is also explained. (author)
High-density multicore fibers

DEFF Research Database (Denmark)

Takenaga, K.; Matsuo, S.; Saitoh, K.

2016-01-01

High-density single-mode multicore fibers were designed and fabricated. A heterogeneous 30-core fiber realized a low crosstalk of −55 dB. A quasi-single-mode homogeneous 31-core fiber attained the highest core count as a single-mode multicore fiber.......High-density single-mode multicore fibers were designed and fabricated. A heterogeneous 30-core fiber realized a low crosstalk of −55 dB. A quasi-single-mode homogeneous 31-core fiber attained the highest core count as a single-mode multicore fiber....
Multicore and GPU algorithms for Nussinov RNA folding

Science.gov (United States)

2014-01-01

Background One segment of a RNA sequence might be paired with another segment of the same RNA sequence due to the force of hydrogen bonds. This two-dimensional structure is called the RNA sequence's secondary structure. Several algorithms have been proposed to predict an RNA sequence's secondary structure. These algorithms are referred to as RNA folding algorithms. Results We develop cache efficient, multicore, and GPU algorithms for RNA folding using Nussinov's algorithm. Conclusions Our cache efficient algorithm provides a speedup between 1.6 and 3.0 relative to a naive straightforward single core code. The multicore version of the cache efficient single core algorithm provides a speedup, relative to the naive single core algorithm, between 7.5 and 14.0 on a 6 core hyperthreaded CPU. Our GPU algorithm for the NVIDIA C2050 is up to 1582 times as fast as the naive single core algorithm and between 5.1 and 11.2 times as fast as the fastest previously known GPU algorithm for Nussinov RNA folding. PMID:25082539
A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

DEFF Research Database (Denmark)

Li, Gang; Guan, Wei; Sierszecki, Krzysztof

2012-01-01

Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints....... This paper presents a multi-core architecture incorporating a hardware kernel on FPGAs, intended for high performance applications in control engineering domain. First, the hardware kernel is investigated on the basis of a component-based real-time kernel HARTEX (Hard Real-Time Executive for Control Systems...
Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors

NARCIS (Netherlands)

Grelck, C.; Herhut, S.; Jesshope, C.; Joslin, C.; Lankamp, M.; Scholz, S.-B.; Shafarenko, A.

2009-01-01

We present preliminary results from compiling the high-level, functional and data-parallel programming language SaC into a novel multi-core design: Microgrids of Self-Adaptive Virtual Processors (SVPs). The side-effect free nature of SaC in conjunction with its data-parallel foundation make it an
XOP, a fast versatile processor, as a building block for parallel processing in high energy physics experiments

International Nuclear Information System (INIS)

Baehler, P.; Bosco, N.; Lingjaerde, T.; Ljuslin, C.; Van Praag, A.; Werner, P.

1986-01-01

The XOP processor has been designed for trigger calculation and data compression in high energy physics experiments. Therefore, emphasis has been placed upon fast execution and high input/output rate. The fast execution is achieved by a wide instruction word holding operations which are executed concurrently. Thus, the arithmetic operations, data address calculations, data accessing, condition checking, loop count checking and next instruction evaluation all overlap in time. In conventional micro-processors these operations are performed sequentially. In addition, the instruction set comprises not only the classical computer instructions, but also specialized instructions suitable for trigger calculations, such as bit search, population count, loose compare and vector instructions. In order to achieve a high input/output rate, each XOP ECLine interface board is equipped with an input and an output port which fulfil the LeCroy ECLine specifications. The autonomous input port allows a data rate of 40 Mbytes/sec, while the program controlled output port allows 20 Mbytes/sec. For Fastbus based systems a dual Fastbus master interface is under design which allows to build up a Fastbus multi-processor system. This design is being done in collaboration with LAPP Annecy for the CERN Lep L3 experiment. Their scheme comprises 4-5 XOP processors, each of them with a master interface on a data input segment and a master interface on a data output segment. This paper describes the structure of the XOP processor, the interface capabilities and the software development and debugging tools. (Auth.)
Parallelization of the preconditioned IDR solver for modern multicore computer systems

Science.gov (United States)

Bessonov, O. A.; Fedoseyev, A. I.

2012-10-01

This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
Special purpose processors for high energy physics applications

International Nuclear Information System (INIS)

Verkerk, C.

1978-01-01

The review on the subject of hardware processors from very fast decision logic for the split field magnet facility at CERN, to a point-finding processor used to relieve the data-acquisition minicomputer from the task of monitoring the SPS experiment is given. Block diagrams of decision making processor, point-finding processor, complanarity and opening angle processor and programmable track selector module are presented and discussed. The applications of fully programmable but slower processor on the one hand, and very fast and programmable decision logic on the other hand are given in this review
A fast continuous magnetic field measurement system based on digital signal processors

International Nuclear Information System (INIS)

Velev, G.V.; Carcagno, R.; DiMarco, J.; Kotelnikov, S.; Lamm, M.; Makulski, A.; Maroussov, V.; Nehring, R.; Nogiec, J.; Orris, D.; Poukhov, O.; Prakoshyn, F.; Schlabach, P.; Tompkins, J.C.

2005-01-01

In order to study dynamic effects in accelerator magnets, such as the decay of the magnetic field during the dwell at injection and the rapid so-called ''snapback'' during the first few seconds of the resumption of the energy ramp, a fast continuous harmonics measurement system was required. A new magnetic field measurement system, based on the use of digital signal processors (DSP) and Analog to Digital (A/D) converters, was developed and prototyped at Fermilab. This system uses Pentek 6102 16 bit A/D converters and the Pentek 4288 DSP board with the SHARC ADSP-2106 family digital signal processor. It was designed to acquire multiple channels of data with a wide dynamic range of input signals, which are typically generated by a rotating coil probe. Data acquisition is performed under a RTOS, whereas processing and visualization are performed under a host computer. Firmware code was developed for the DSP to perform fast continuous readout of the A/D FIFO memory and integration over specified intervals, synchronized to the probe's rotation in the magnetic field. C, C++ and Java code was written to control the data acquisition devices and to process a continuous stream of data. The paper summarizes the characteristics of the system and presents the results of initial tests and measurements
Multicore in production: advantages and limits of the multiprocess approach in the ATLAS experiment

International Nuclear Information System (INIS)

Binet, S; Calafiura, P; Lavrijsen, W; Leggett, C; Tatarkhanov, M; Tsulaia, V; Jha, M K; Lesny, D; Severini, H; Smith, D; Snyder, S; VanGemmeren, P; Washbrook, A

2012-01-01

The shared memory architecture of multicore CPUs provides HEP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelize HEP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm, which allowed us to achieve sharing of almost 80% of memory pages among event worker processes for certain types of reconstruction jobs with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to parallelize large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. The process of validating AthenaMP for production took ten months of concentrated effort and is expected to continue for several more months. Besides validating the software itself, an important and time-consuming aspect of running multicore applications in production was to configure the ATLAS distributed production system to handle multicore jobs. This entailed defining multicore batch queues, where the unit resource is not a core, but a whole computing node; monitoring the output of many event workers; and adapting the job definition layer to handle computing resources with different event throughputs. We will present scalability and memory usage studies, based on data gathered both on dedicated hardware and at the CERN Computer Center.
The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

CERN Document Server

Annovi, A; Bevacqua, V; Cervigni, F; Crescioli, F; Fabbri, L; Giannetti, P; Giorgi, F; Magalotti, D; Negri, A; Piendibene, M; Roda, C; Sbarra, C; Villa, M; Vitillo, RA; Volpi, G

2012-01-01

The FastTracKer (FTK), a hardware dedicated processor, performs fast and precise online full track reconstruction at the ATLAS experiment, within an average latency of few dozens of microseconds. \\ It is made of two pipelined processors, the Associative Memory finding low precision tracks, and the Track Fitter refining the track quality with high precision fits. FTK has to face the Large Hadron Collider (LHC) Phase I luminosity. So, while the new processor requires the best of the available technology for tracking in high occupancy conditions, we want to use already existing prototypes to exercise soon the FTK functions in the new ATLAS environment. Few boards connected together constitute a "baby FTK" that will grow soon becoming the "vertical slice".\\ The vertical slice will cover a small projective wedge in the detector, but it will be functionally complete. It will provide a full test of the entire FTK data chain, in the laboratory first and on beam-on conditions after. It will require early development a...
Low pain vs no pain multi-core Haskells

DEFF Research Database (Denmark)

Aswad, Mustafa; Trinder, Phil; Al Zain, Abyd

2011-01-01

to compare a ‘no pain’, i.e. entirely implicit, parallel implementation with three ‘low pain’, i.e. semi-explicit, language implementations. We report detailed studies comparing the parallel performance delivered. The comparative performance metric is speedup which normalises against sequential performance...... the parallelism. The results of the study are encouraging and, on occasion, surprising. We find that fully implicit parallelism as implemented in FDIP cannot yet compete with semi-explicit parallel approaches. Semi-explicit parallelism shows encouraging speedup for many of the programs in the test suite......Multicore and NUMA architectures are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is a systematic...
A fast continuous magnetic field measurement system based on digital signal processors

Energy Technology Data Exchange (ETDEWEB)

Velev, G.V.; Carcagno, R.; DiMarco, J.; Kotelnikov, S.; Lamm, M.; Makulski, A.; /Fermilab; Maroussov, V.; /Purdue U.; Nehring, R.; Nogiec, J.; Orris, D.; /Fermilab; Poukhov,; Prakoshyn, F.; /Dubna, JINR; Schlabach, P.; Tompkins, J.C.; /Fermilab

2005-09-01

In order to study dynamic effects in accelerator magnets, such as the decay of the magnetic field during the dwell at injection and the rapid so-called ''snapback'' during the first few seconds of the resumption of the energy ramp, a fast continuous harmonics measurement system was required. A new magnetic field measurement system, based on the use of digital signal processors (DSP) and Analog to Digital (A/D) converters, was developed and prototyped at Fermilab. This system uses Pentek 6102 16 bit A/D converters and the Pentek 4288 DSP board with the SHARC ADSP-2106 family digital signal processor. It was designed to acquire multiple channels of data with a wide dynamic range of input signals, which are typically generated by a rotating coil probe. Data acquisition is performed under a RTOS, whereas processing and visualization are performed under a host computer. Firmware code was developed for the DSP to perform fast continuous readout of the A/D FIFO memory and integration over specified intervals, synchronized to the probe's rotation in the magnetic field. C, C++ and Java code was written to control the data acquisition devices and to process a continuous stream of data. The paper summarizes the characteristics of the system and presents the results of initial tests and measurements.
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

Science.gov (United States)

Sourouri, Mohammed; Birger Raknes, Espen

2017-04-01

In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

Energy Technology Data Exchange (ETDEWEB)

Villa, Oreste; Tumeo, Antonino; Secchi, Simone; Manzano Franco, Joseph B.

2012-12-31

Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, we introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.
Associative Memory Design for the Fast TracKer Processor (FTK)at ATLAS

CERN Document Server

Annovi, A; The ATLAS collaboration; Beretta, M; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Hoff, J; Liu, T; Liberali, V; Sacco, I; Schoening, A; Soltveit, H K; Stabile, A; Tripiccione, R

2011-01-01

We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. A large CAM bank stores all trajectories of interest and extracts the ones compatible with a given event. This task is naturally parallelized by a CAM architecture able to output identified trajectories, recognized among a huge amount of possible combinations, in just a few 100 MHz clock cycles. We have developed this device (called the AMchip03 processor), using 180 nm technology, for the Silicon Vertex Trigger (SVT) upgrade at CDF [1] using a standard-cell VLSI design methodology. We propose a new design that introduces a full custom CAM cell and takes advantage of 65 nm technology. The customized design maximizes the pattern density, minimizes the power consumption and implements the functionalities needed for the planned Fast Tracker (FTK) [2], an ATLAS trigger upgrade project at LHC. We introduce a new variable resolution patt...
High performance deformable image registration algorithms for manycore processors

CERN Document Server

Shackleford, James; Sharp, Gregory

2013-01-01

High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E
Parallel processor for fast event analysis

International Nuclear Information System (INIS)

Hensley, D.C.

1983-01-01

Current maximum data rates from the Spin Spectrometer of approx. 5000 events/s (up to 1.3 MBytes/s) and minimum analysis requiring at least 3000 operations/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed where up to 4 independent processors will be implemented in parallel. The individual processors are designed around the Am2910 Microsequencer, the AM29116 μP, and the Am29517 Multiplier. Satellite histogramming in a mass memory system will be managed by a commercial 16-bit μP system
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

Directory of Open Access Journals (Sweden)

Xiangyu Li

2017-02-01

Full Text Available This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430, and that it can make a system do more valuable works and make more than 99.9% use of the power budget.

Getting More From Your Multicore: Exploiting OpenMP for Astronomy

Science.gov (United States)

Noble, M. S.

2008-08-01

Motivated by the emergence of multicore architectures, and the reality that parallelism is rarely used for analysis in observational astronomy, we demonstrate how general users may employ tightly-coupled multiprocessors in scriptable research calculations while requiring no special knowledge of parallel programming. Our method rests on the observation that much of the appeal of high-level vectorized languages like IDL or MatLab stems from relatively simple internal loops over regular array structures, and that these loops are highly amenable to automatic parallelization with OpenMP. We discuss how ISIS, an open-source astrophysical analysis system embedding the slang numerical language, was easily adapted to exploit this pattern. Drawing from a common astrophysical problem, model fitting, we present beneficial speedups for several machine and compiler configurations. These results complement our previous efforts with PVM, and together lead us to believe that ISIS is the only general purpose spectroscopy system in which such a range of parallelism - from single processors on multiple machines to multiple processors on single machines - has been demonstrated.
ATCA/AXIe compatible board for fast control and data acquisition in nuclear fusion experiments

International Nuclear Information System (INIS)

Batista, A.J.N.; Leong, C.; Bexiga, V.; Rodrigues, A.P.; Combo, A.; Carvalho, B.B.; Fortunato, J.; Correia, M.; Teixeira, J.P.; Teixeira, I.C.; Sousa, J.; Gonçalves, B.; Varandas, C.A.F.

2012-01-01

Highlights: ► High performance board for fast control and data acquisition. ► Large IO channel number per board with galvanic isolation. ► Optimized for high reliability and availability. ► Targeted for nuclear fusion experiments with long duration discharges. ► To be used on the ITER Fast Plant System Controller prototype. - Abstract: An in-house development of an Advanced Telecommunications Computing Architecture (ATCA) board for fast control and data acquisition, with Input/Output (IO) processing capability, is presented. The architecture, compatible with the ATCA (PICMG 3.4) and ATCA eXtensions for Instrumentation (AXIe) specifications, comprises a passive Rear Transition Module (RTM) for IO connectivity to ease hot-swap maintenance and simultaneously to increase cabling life cycle. The board complies with ITER Fast Plant System Controller (FPSC) guidelines for rear IO connectivity and redundancy, in order to provide high levels of reliability and availability to the control and data acquisition systems of nuclear fusion devices with long duration plasma discharges. Simultaneously digitized data from all Analog to Digital Converters (ADC) of the board can be filtered/decimated in a Field Programmable Gate Array (FPGA), decreasing data throughput, increasing resolution, and sent through Peripheral Component Interconnect (PCI) Express to multi-core processors in the ATCA shelf hub slots. Concurrently the multi-core processors can update the board Digital to Analog Converters (DAC) in real-time. Full-duplex point-to-point communication links between all FPGAs, of peer boards inside the shelf, allow the implementation of distributed algorithms and Multi-Input Multi-Output (MIMO) systems. Support for several timing and synchronization solutions is also provided. Some key features are onboard ADC or DAC modules with galvanic isolation, Xilinx Virtex 6 FPGA, standard Dual Data Rate (DDR) 3 SODIMM memory, standard CompactFLASH memory card, Intelligent
Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors

OpenAIRE

Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent

2013-01-01

Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...
Multicore Performance of Block Algebraic Iterative Reconstruction Methods

DEFF Research Database (Denmark)

Sørensen, Hans Henrik B.; Hansen, Per Christian

2014-01-01

Algebraic iterative methods are routinely used for solving the ill-posed sparse linear systems arising in tomographic image reconstruction. Here we consider the algebraic reconstruction technique (ART) and the simultaneous iterative reconstruction techniques (SIRT), both of which rely on semiconv......Algebraic iterative methods are routinely used for solving the ill-posed sparse linear systems arising in tomographic image reconstruction. Here we consider the algebraic reconstruction technique (ART) and the simultaneous iterative reconstruction techniques (SIRT), both of which rely...... on semiconvergence. Block versions of these methods, based on a partitioning of the linear system, are able to combine the fast semiconvergence of ART with the better multicore properties of SIRT. These block methods separate into two classes: those that, in each iteration, access the blocks in a sequential manner...... a fixed relaxation parameter in each method, namely, the one that leads to the fastest semiconvergence. Computational results show that for multicore computers, the sequential approach is preferable....
A fast filter processor as a part of the trigger logic in an elastic scattering experiment

International Nuclear Information System (INIS)

Kenyon Gjerpe, I.

1981-01-01

A fast special purpose processor as a part of the trigger logic in an elastic scattering experiment is described. The decision to incorporate such a processor was taken because the trigger rate was estimated to be an order of magnitude higher than the date taking capability of the on-line minicomputer, a NORD 10. The processor is capable of checking the coplanarity and the opening angle of the two outgoing tracks within about 100 μs. This is done with a spatial resolution of 1 mm by using two points each track given by 3 MWPCs. For comparison this is two orders of magnitude faster than the same algorithm coded in assembly language on a PDP 11/40. The main contribution to this increased speed is due to extensive use of pipelining and parallelism. When running with the processor in the trigger, 75% more elastic events per incoming beam particle were collected, and 3 times as many elastic events per trigger were recorded on to tape for further in-depth analysis, than previously. Due to major improvements in the primary trigger logic this was less than the gain initially anticipated. A first version of the processor was designed and constructed in the CERN DD division by J. Joosten, M. Letheren and B. Martin under the supervision of C. Verkerk. The author was involved in the final design, construction and testing, and subsequently was responsible for the intergration, programming and running of the processor in the experiment. (orig.)
TMS320C25 Digital Signal Processor For 2-Dimensional Fast Fourier Transform Computation

International Nuclear Information System (INIS)

Ardisasmita, M. Syamsa

1996-01-01

The Fourier transform is one of the most important mathematical tool in signal processing and analysis, which converts information from the time/spatial domain into the frequency domain. Even with implementation of the Fast Fourier Transform algorithms in imaging data, the discrete Fourier transform execution consume a lot of time. Digital signal processors are designed specifically to perform computation intensive digital signal processing algorithms. By taking advantage of the advanced architecture. parallel processing, and dedicated digital signal processing (DSP) instruction sets. This device can execute million of DSP operations per second. The device architecture, characteristics and feature suitable for fast Fourier transform application and speed-up are discussed
Advanced control system for the Integral Fast Reactor fuel pin processor

International Nuclear Information System (INIS)

Lau, L.D.; Randall, P.F.; Benedict, R.W.; Levinskas, D.

1993-01-01

A computerized control system has been developed for the remotely-operated fuel pin processor used in the Integral Fast Reactor Program, Fuel Cycle Facility (FCF). The pin processor remotely shears cast EBR- reactor fuel pins to length, inspects them for diameter, straightness, length, and weight, and then inserts acceptable pins into new sodium-loaded stainless-steel fuel element jackets. Two main components comprise the control system: (1) a programmable logic controller (PLC), together with various input/output modules and associated relay ladder-logic associated computer software. The PLC system controls the remote operation of the machine as directed by the OCS, and also monitors the machine operation to make operational data available to the OCS. The OCS allows operator control of the machine, provides nearly real-time viewing of the operational data, allows on-line changes of machine operational parameters, and records the collected data for each acceptable pin on a central data archiving computer. The two main components of the control system provide the operator with various levels of control ranging from manual operation to completely automatic operation by means of a graphic touch screen interface
An Energy-Aware Runtime Management of Multi-Core Sensory Swarms

Directory of Open Access Journals (Sweden)

Sungchan Kim

2017-08-01

Full Text Available In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

Science.gov (United States)

Cheung, Kit; Schultz, Simon R; Luk, Wayne

2015-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

KAUST Repository

Malas, T.; Hager, G.; Ltaief, Hatem; Keyes, David E.

2015-01-01

Today’s high-end multicore systems are characterized by a deep memory hierarchy, i.e., several levels of local and shared caches, with limited size and bandwidth per core. The ever-increasing gap between the processor and memory speed will further
The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

CERN Document Server

Annovi, A; The ATLAS collaboration; Villa, M; Bevacqua, V; Vitillo, R A; Giorgi, F; Magalotti, D; Roda, C; Cervigni, F; Giannetti, P; Negri, A; Piendibene, M; Volpi, G; Fabbri, L; Sbarra, C

2011-01-01

The FastTracKer (FTK) is a dedicated hardware system able to perform online fast and precise track reconstruction of the full events at the Atlas experiment, within an average latency of few dozens of microseconds. It is made of two pipelined processors: the Associative Memory (AM), finding low precision tracks called "roads", and the Track Fitter (TF), refining the track quality with high precision fits. The FTK design [1] that works well at the Large Hadron Collider (LHC) Phase I luminosity requires the best of the available technology for tracking in high occupancy conditions. While the new processor is designed for the most demanding LHC conditions, we want to use already existing prototypes, part of them developed for the SLIM5 collaboration [2], to exercise the FTK functions in the new Atlas environment. During Laboratory tests, the EDRO board (Event Dispatch and Read-Out) receives on a clustering mezzanine (able to calculate the pixel and SCT cluster centroids) "fake" detector raw data on S-links from ...
How to harness the performance potential of current multi-core processors

International Nuclear Information System (INIS)

Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej

2011-01-01

Leakage currents have put a stop to the semiconductor industry's ability to increase processor frequency in order to enhance the performance of new microprocessors. Instead, we observe a slew of changes inside the micro-architecture with an aim of enhancing the performance. Several of these changes, however, do not translate into automatic speed improvements for the software. This paper discusses the increased complexity of modern microprocessors by separating out into dimensions each feature that impacts performance and mentions briefly ways of improving software, in particular that of the High Energy Physics community, to take full advantage.
muBLASTP: database-indexed protein sequence search on multicore CPUs.

Science.gov (United States)

Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

2016-11-04

The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Monte Carlo simulation of magnetic multi-core nanoparticles

International Nuclear Information System (INIS)

Schaller, Vincent; Wahnstroem, Goeran; Sanz-Velasco, Anke; Enoksson, Peter; Johansson, Christer

2009-01-01

In this paper, a Monte Carlo simulation is carried out to evaluate the equilibrium magnetization of magnetic multi-core nanoparticles in a liquid and subjected to a static magnetic field. The particles contain a magnetic multi-core consisting of a cluster of magnetic single-domains of magnetite. We show that the magnetization of multi-core nanoparticles cannot be fully described by a Langevin model. Inter-domain dipolar interactions and domain magnetic anisotropy contribute to decrease the magnetization of the particles, whereas the single-domain size distribution yields an increase in magnetization. Also, we show that the interactions affect the effective magnetic moment of the multi-core nanoparticles.
Design of Networks-on-Chip for Real-Time Multi-Processor Systems-on-Chip

DEFF Research Database (Denmark)

Sparsø, Jens

2012-01-01

This paper addresses the design of networks-on-chips for use in multi-processor systems-on-chips - the hardware platforms used in embedded systems. These platforms typically have to guarantee real-time properties, and as the network is a shared resource, it has to provide service guarantees...... (bandwidth and/or latency) to different communication flows. The paper reviews some past work in this field and the lessons learned, and the paper discusses ongoing research conducted as part of the project "Time-predictable Multi-Core Architecture for Embedded Systems" (T-CREST), supported by the European...
Hybrid programming model for implicit PDE simulations on multicore architectures

KAUST Repository

Kaushik, Dinesh; Keyes, David E.; Balay, Satish; Smith, Barry F.

2011-01-01

The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.
Computer Architecture A Quantitative Approach

CERN Document Server

Hennessy, John L

2007-01-01

The era of seemingly unlimited growth in processor performance is over: single chip architectures can no longer overcome the performance limitations imposed by the power they consume and the heat they generate. Today, Intel and other semiconductor firms are abandoning the single fast processor model in favor of multi-core microprocessors--chips that combine two or more processors in a single package. In the fourth edition of Computer Architecture, the authors focus on this historic shift, increasing their coverage of multiprocessors and exploring the most effective ways of achieving parallelis
Multi-core processing and scheduling performance in CMS

International Nuclear Information System (INIS)

Hernández, J M; Evans, D; Foulkes, S

2012-01-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.
Data Rate Estimation for Wireless Core-to-Cache Communication in Multicore CPUs

Directory of Open Access Journals (Sweden)

M. Komar

2015-01-01

Full Text Available In this paper, a principal architecture of common purpose CPU and its main components are discussed, CPUs evolution is considered and drawbacks that prevent future CPU development are mentioned. Further, solutions proposed so far are addressed and a new CPU architecture is introduced. The proposed architecture is based on wireless cache access that enables a reliable interaction between cores in multicore CPUs using terahertz band, 0.1-10THz. The presented architecture addresses the scalability problem of existing processors and may potentially allow to scale them to tens of cores. As in-depth analysis of the applicability of the suggested architecture requires accurate prediction of traffic in current and next generations of processors, we consider a set of approaches for traffic estimation in modern CPUs discussing their benefits and drawbacks. The authors identify traffic measurements by using existing software tools as the most promising approach for traffic estimation, and they use Intel Performance Counter Monitor for this purpose. Three types of CPU loads are considered including two artificial tests and background system load. For each load type the amount of data transmitted through the L2-L3 interface is reported for various input parameters including the number of active cores and their dependences on the number of cores and operational frequency.
Multi-core fiber undersea transmission systems

DEFF Research Database (Denmark)

Nooruzzaman, Md; Morioka, Toshio

2017-01-01

Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components.......Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components....

Multicore in Production: Advantages and Limits of the Multiprocess Approach

CERN Document Server

Binet, S; The ATLAS collaboration; Lavrijsen, W; Leggett, Ch; Lesny, D; Jha, M K; Severini, H; Smith, D; Snyder, S; Tatarkhanov, M; Tsulaia, V; van Gemmeren, P; Washbrook, A

2011-01-01

The shared memory architecture of multicore CPUs provides HENP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelizing HENP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm which allows to share up to 50% memory pages among event worker processes with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to run in parallel large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. In spite of this, the process of validating athena multi-process for production took ten months of concentrated effort and is expected to continue for several more months. In general terms, we had two classes of problems in the multi-process port: merging the output fil...
Effective particle magnetic moment of multi-core particles

NARCIS (Netherlands)

Ahrentorp, F.; Astalan, A.; Blomgren, J.; Jonasson, C.; Wetterskog, E.; Svedlindh, P.; Lak, A.; Ludwig, F.; Van IJzendoorn, L.J.; Westphal, F.; Grüttner, C.; Gehrke, N.; Gustafsson, S.; Olsson, E.; Johansson, C.

2015-01-01

In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron
The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

CERN Document Server

Andreani, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Cipriani, R; Citraro, S; Citterio, M; Colombo, A; Crescioli, F; Dimas, D; Donati, S; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A

2014-01-01

The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. Special relevance will be given to the AMchip design that includes two custom cells optimized for low consumption. We repo...
Fast Parallel Computation of Polynomials Using Few Processors

DEFF Research Database (Denmark)

Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.

1983-01-01

It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....
Median and Morphological Specialized Processors for a Real-Time Image Data Processing

Directory of Open Access Journals (Sweden)

Kazimierz Wiatr

2002-01-01

Full Text Available This paper presents the considerations on selecting a multiprocessor MISD architecture for fast implementation of the vision image processing. Using the authorÃ¢Â€Â²s earlier experience with real-time systems, implementing of specialized hardware processors based on the programmable FPGA systems has been proposed in the pipeline architecture. In particular, the following processors are presented: median filter and morphological processor. The structure of a universal reconfigurable processor developed has been proposed as well. Experimental results are presented as delays on LCA level implementation for median filter, morphological processor, convolution processor, look-up-table processor, logic processor and histogram processor. These times compare with delays in general purpose processor and DSP processor.
Fast parallel computation of polynomials using few processors

DEFF Research Database (Denmark)

Valiant, Leslie; Skyum, Sven

1981-01-01

It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....
Multi-Softcore Architecture on FPGA

Directory of Open Access Journals (Sweden)

Mouna Baklouti

2014-01-01

Full Text Available To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication.
Early Student Support for Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

Science.gov (United States)

2016-05-07

REPORT DOCUMENTATION PAGE I . ... ... .. . ,...,.., ............. OMB No. 0704-0188 The public reporting burden for this collection of...Student Support for Appl ication of Advanced Multi- Core Processor N00014-12-1-0298 Technologies to Oceanographic Research Sb. GRANT NUMBER Sc...communications protocols (i.e. UART, I2C, and SPI), through the , ’ . handing off of the data to the server APis. By providing a common set of tools
Online track processor for the CDF upgrade

International Nuclear Information System (INIS)

Thomson, E. J.

2002-01-01

A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers

Science.gov (United States)

Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori

OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API
Design issues for numerical libraries on scalable multicore architectures

International Nuclear Information System (INIS)

Heroux, M A

2008-01-01

Future generations of scalable computers will rely on multicore nodes for a significant portion of overall system performance. At present, most applications and libraries cannot exploit multiple cores beyond running addition MPI processes per node. In this paper we discuss important multicore architecture issues, programming models, algorithms requirements and software design related to effective use of scalable multicore computers. In particular, we focus on important issues for library research and development, making recommendations for how to effectively develop libraries for future scalable computer systems
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Oliker, Leonid [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Vuduc, Richard [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Shalf, John [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelick, Katherine [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Demmel, James [Univ. of California, Berkeley, CA (United States)

2007-01-01

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.
Analytic processor model for fast design-space exploration

NARCIS (Netherlands)

Jongerius, R.; Mariani, G.; Anghel, A.; Dittmann, G.; Vermij, E.; Corporaal, H.

2015-01-01

In this paper, we propose an analytic model that takes as inputs a) a parametric microarchitecture-independent characterization of the target workload, and b) a hardware configuration of the core and the memory hierarchy, and returns as output an estimation of processor-core performance. To validate
High core count single-mode multicore fiber for dense space division multiplexing

DEFF Research Database (Denmark)

Aikawa, K.; Sasaki, Y.; Amma, Y.

2016-01-01

Multicore fibers and few-mode fibers have the potential to realize dense-space-division multiplexing systems. Several dense-space-division multiplexing system transmission experiments over multicore fibers and few-mode fibers have been demonstrated so far. Multicore fibers, including recent resul...
Real time processor for array speckle interferometry

Science.gov (United States)

Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

1989-02-01

The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.
A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

CERN Document Server

Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

2014-01-01

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. ...
A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

CERN Document Server

Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

2014-01-01

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. T...
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel; Oliker, Leonid; Vuduc, Richard; Shalf, John; Yelick, Katherine; Demmel, James

2008-10-16

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.
Cross talk analysis in multicore optical fibers by supermode theory.

Science.gov (United States)

Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz

2016-08-15

We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.
Real-Time Operating Systems for Multicore Embedded Systems

OpenAIRE

Tomiyama, Hiroyuki; Honda, Shinya; Takada, Hiroaki

2008-01-01

Multicore systems-on-chip have become popular inthe design of embedded systems in order to simultaneously achieve high performance and low power consumption. On the software side, real-time operating systems are necessary in orderto handle growing complexity of embedded software. This paper describes requirements, design principles and implementation techniques for real-time operating systems to be used inasymmetric multicore systems.

MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters.

Science.gov (United States)

Gonzalez-Dominguez, Jorge; Martin, Maria J

2017-10-10

In this work we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. Source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
XL-100S microprogrammable processor

International Nuclear Information System (INIS)

Gorbunov, N.V.; Guzik, Z.; Sutulin, V.A.; Forytski, A.

1983-01-01

The XL-100S microprogrammable processor providing the multiprocessor operation mode in the XL system crate is described. The processor meets the EUR 6500 CAMAC standards, address up to 4 Mbyte memory, and interacts with 7 CAMAC branchas. Eight external requests initiate operations preset by a sequence of microcommands in a memory of the capacity up to 64 kwords of 32-Git. The microprocessor architecture allows one to emulate commands of the majority of mini- or micro-computers, including floating point operations. The XL-100S processor may be used in various branches of experimental physics: for physical experiment apparatus control, fast selection of useful physical events, organization of the of input/output operations, organization of direct assess to memory included, etc. The Am2900 microprocessor set is used as an elementary base. The device is made in the form of a single width CAMAC module
Integrated Optoelectronic Networks for Application-Driven Multicore Computing

Science.gov (United States)

2017-05-08

AFRL-AFOSR-VA-TR-2017-0102 Integrated Optoelectronic Networks for Application- Driven Multicore Computing Sudeep Pasricha COLORADO STATE UNIVERSITY...AND SUBTITLE Integrated Optoelectronic Networks for Application-Driven Multicore Computing 5a. CONTRACT NUMBER 5b. GRANT NUMBER FA9550-13-1-0110 5c...and supportive materials with innovative architectural designs that integrate these components according to system-wide application needs. 15
GoCxx: a tool to easily leverage C++ legacy code for multicore-friendly Go libraries and frameworks

International Nuclear Information System (INIS)

Binet, Sébastien

2012-01-01

Current HENP libraries and frameworks were written before multicore systems became widely deployed and used. From this environment, a ‘single-thread’ processing model naturally emerged but the implicit assumptions it encouraged are greatly impairing our abilities to scale in a multicore/manycore world. Writing scalable code in C++ for multicore architectures, while doable, is no panacea. Sure, C++11 will improve on the current situation (by standardizing on std::thread, introducing lambda functions and defining a memory model) but it will do so at the price of complicating further an already quite sophisticated language. This level of sophistication has probably already strongly motivated analysis groups to migrate to CPython, hoping for its current limitations with respect to multicore scalability to be either lifted (Grand Interpreter Lock removal) or for the advent of a new Python VM better tailored for this kind of environment (PyPy, Jython, …) Could HENP migrate to a language with none of the deficiencies of C++ (build time, deployment, low level tools for concurrency) and with the fast turn-around time, simplicity and ease of coding of Python? This paper will try to make the case for Go - a young open source language with built-in facilities to easily express and expose concurrency - being such a language. We introduce GoCxx, a tool leveraging gcc-xml's output to automatize the tedious work of creating Go wrappers for foreign languages, a critical task for any language wishing to leverage legacy and field-tested code. We will conclude with the first results of applying GoCxx to real C++ code.
Performance of the AMBFTK board for the FastTracker processor for the ATLAS detector upgrade

International Nuclear Information System (INIS)

Alberti, F; Citterio, M; Liberali, V; Meroni, C; Andreani, A; Stabile, A; Annovi, A; Beretta, M; Crescioli, F; Dell'Orso, M; Piendibene, M; Volpi, G; Giannetti, P; Lanza, A; Magalotti, D; Sacco, I

2013-01-01

Modern experiments at hadron colliders search for extremely rare processes hidden in a very large background. As the experiment complexity and the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. The FastTracker (FTK) processor for the ATLAS experiment offers extremely powerful, very compact and low power consumption processing units for the future, which is essential for increased efficiency and purity in the Level 2 trigger selection through the intensive use of tracking. Pattern recognition is performed with Associative Memories (AM). The AMBFTK board and the AMchip04 integrated circuit have been designed specifically for this purpose. We report on the preliminary test results of the first prototypes of the AMBFTK board and of the AMchip04.
Fast parallel ring recognition algorithm in the RICH detector of the CBM experiment at FAIR

International Nuclear Information System (INIS)

Lebedev, S.

2011-01-01

The Compressed Baryonic Matter (CBM)experiment at the future FAIR facility at Darmstadt will measure dileptons emitted from the hot and dense phase in heavy ion collisions. In case of an electron measurement, a high purity of identified electrons is required in order to suppress the background. Electron identification in CBM will be performed by a Ring Imaging Cherenkov (RICH) detector and Transition Radiation Detector (TRD). Very fast data reconstruction is extremely important for CBM because of the huge amount of data which has to be handled. In this contribution, a parallelized ring recognition algorithm is presented. Modern CPUs have two features, which enable parallel programming. First, the SSE technology allows using the SIMD execution model. Second, multicore CPUs enable the use of multithreading. Both features have been implemented in the ring reconstruction of the RICH detector. A considerable speedup factor from 357 to 2.5 ms/event has been achieved including preceding code optimization for Intel Xeon X5550 processors at 2.67 GHz
High-spatial-multiplicity multi-core fibres for future dense space-division-multiplexing system

DEFF Research Database (Denmark)

Matsuo, Shoichiro; Takenaga, Katsuhiro; Saitoh, Kunimasa

2015-01-01

Design and fabrication results of high-spatial-multiplicity multi-core fibres are presented. A 30-core single-mode multi-core fibre and a 36-spatial-channels multi-core fibre with low differential mode delay have been realized with low-crosstalk characteristics through optimisation of core struct...
Effective particle magnetic moment of multi-core particles

Energy Technology Data Exchange (ETDEWEB)

Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden); Wetterskog, Erik; Svedlindh, Peter [Department of Engineering Sciences, Uppsala University, Box 534, SE-751 21 Uppsala (Sweden); Lak, Aidin; Ludwig, Frank [Institute of Electrical Measurement and Fundamental Electrical Engineering, TU Braunschweig, D‐38106 Braunschweig Germany (Germany); IJzendoorn, Leo J. van [Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven (Netherlands); Westphal, Fritz; Grüttner, Cordula [Micromod Partikeltechnologie GmbH, D ‐18119 Rostock (Germany); Gehrke, Nicole [nanoPET Pharma GmbH, D ‐10115 Berlin Germany (Germany); Gustafsson, Stefan; Olsson, Eva [Department of Applied Physics, Chalmers University of Technology, SE-412 96 Göteborg (Sweden); Johansson, Christer, E-mail: christer.johansson@acreo.se [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden)

2015-04-15

In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems – BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm – and one single-core particle system – SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm.
Effective particle magnetic moment of multi-core particles

International Nuclear Information System (INIS)

Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian; Wetterskog, Erik; Svedlindh, Peter; Lak, Aidin; Ludwig, Frank; IJzendoorn, Leo J. van; Westphal, Fritz; Grüttner, Cordula; Gehrke, Nicole; Gustafsson, Stefan; Olsson, Eva; Johansson, Christer

2015-01-01

In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems – BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm – and one single-core particle system – SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm
Effective particle magnetic moment of multi-core particles

Science.gov (United States)

Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian; Wetterskog, Erik; Svedlindh, Peter; Lak, Aidin; Ludwig, Frank; van IJzendoorn, Leo J.; Westphal, Fritz; Grüttner, Cordula; Gehrke, Nicole; Gustafsson, Stefan; Olsson, Eva; Johansson, Christer

2015-04-01

In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems - BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm - and one single-core particle system - SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm.
An object-oriented bulk synchronous parallel library for multicore programming

NARCIS (Netherlands)

Yzelman, A.N.; Bisseling, R.H.

2012-01-01

We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has
Measuring Performance of Soft Real-Time Tasks on Multi-core Systems

OpenAIRE

Rafiq, Salman

2011-01-01

Multi-core platforms are well established, and they are slowly moving into the area of embedded and real-time systems. Nowadays to take advantage of multi-core systems in terms of throughput, soft real-time applications are run together with general purpose applications under an operating system such as Linux. But due to shared hardware resources in multi-core architectures, it is likely that these applications will interfere and compete with each other. This can cause slower response times f...
MPC Related Computational Capabilities of ARMv7A Processors

DEFF Research Database (Denmark)

Frison, Gianluca; Jørgensen, John Bagterp

2015-01-01

In recent years, the mass market of mobile devices has pushed the demand for increasingly fast but cheap processors. ARM, the world leader in this sector, has developed the Cortex-A series of processors with focus on computationally intensive applications. If properly programmed, these processors...... are powerful enough to solve the complex optimization problems arising in MPC in real-time, while keeping the traditional low-cost and low-power consumption. This makes these processors ideal candidates for use in embedded MPC. In this paper, we investigate the floating-point capabilities of Cortex A7, A9...... and A15 and show how to exploit the unique features of each processor to obtain the best performance, in the context of a novel implementation method for the linear-algebra routines used in MPC solvers. This method adapts high-performance computing techniques to the needs of embedded MPC. In particular...
A programmable systolic trigger processor for FERA bus data

International Nuclear Information System (INIS)

Appelquist, G.; Hovander, B.; Sellden, B.; Bohm, C.

1992-09-01

A generic CAMAC based trigger processor module for fast processing of large amounts of ADC data, has been designed. This module has been realised using complex programmable gate arrays (LCAs from XILINX). The gate arrays have been connected to memories and multipliers in such a way that different gate array configurations can cover a wide range of module applications. Using this module, it is possible to construct complex trigger processors. The module uses both the fast ECL FERA bus and the CAMAC bus for inputs and outputs. The latter, however, is primarily used for set-up and control but may also be used for data output. Large numbers of ADCs can be served by a hierarchical arrangement of trigger processor modules, processing ADC data with pipe-line arithmetics producing the final result at the apex of the pyramid. The trigger decision will be transmitted to the data acquisition system via a logic signal while numeric results may be extracted by the CAMAC controller. The trigger processor was originally developed for the proposed neutral particle search experiment at CERN, NUMASS. There it was designed to serve as a second level trigger processor. It was required to correct all ADC raw data for efficiency and pedestal, calculate the total calorimeter energy, obtain the optimal time of flight data and calculate the particle mass. A suitable mass cut would then deliver the trigger decision. More complex triggers were also considered. (au)
Image matrix processor for fast multi-dimensional computations

Science.gov (United States)

Roberson, George P.; Skeate, Michael F.

1996-01-01

An apparatus for multi-dimensional computation which comprises a computation engine, including a plurality of processing modules. The processing modules are configured in parallel and compute respective contributions to a computed multi-dimensional image of respective two dimensional data sets. A high-speed, parallel access storage system is provided which stores the multi-dimensional data sets, and a switching circuit routes the data among the processing modules in the computation engine and the storage system. A data acquisition port receives the two dimensional data sets representing projections through an image, for reconstruction algorithms such as encountered in computerized tomography. The processing modules include a programmable local host, by which they may be configured to execute a plurality of different types of multi-dimensional algorithms. The processing modules thus include an image manipulation processor, which includes a source cache, a target cache, a coefficient table, and control software for executing image transformation routines using data in the source cache and the coefficient table and loading resulting data in the target cache. The local host processor operates to load the source cache with a two dimensional data set, loads the coefficient table, and transfers resulting data out of the target cache to the storage system, or to another destination.
Wavelength-dependent Crosstalk in Trench-Assisted Multi-Core Fibers

DEFF Research Database (Denmark)

Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

2014-01-01

Analytical expressions for wavelength-dependent crosstalk in homogeneous trench-assisted multi-core fibers are derived. The calculated results from the expressions agree well with the numerical simulation results based on finite element method.......Analytical expressions for wavelength-dependent crosstalk in homogeneous trench-assisted multi-core fibers are derived. The calculated results from the expressions agree well with the numerical simulation results based on finite element method....
Analytic clock frequency selection for global DVFS

OpenAIRE

Gerards, Marco Egbertus Theodorus; Hurink, Johann L.; Holzenspies, P.K.F.; Kuper, Jan; Smit, Gerardus Johannes Maria

2014-01-01

Computers can reduce their power consumption by decreasing their speed using Dynamic Voltage and Frequency Scaling (DVFS). A form of DVFS for multicore processors is global DVFS, where the voltage and clock frequency is shared among all processor cores. Because global DVFS is efficient and cheap to implement, it is used in modern multicore processors like the IBM Power 7, ARM Cortex A9 and NVIDIA Tegra 2. This theory oriented paper discusses energy optimal DVFS algorithms for such processors....
Automatic Functionality Assignment to AUTOSAR Multicore Distributed Architectures

DEFF Research Database (Denmark)

Maticu, Florin; Pop, Paul; Axbrink, Christian

2016-01-01

The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, where several functions may share resources on an ECU. In addition, multicore ECUs are being adopted because...... of better performance, cost, size, fault-tolerance and power consumption. In this paper we present an approach for the automatic software functionality assignment to multicore distributed architectures. We consider that the systems use the AUTomotive Open System ARchitecture (AUTOSAR). The functionality...
Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

Directory of Open Access Journals (Sweden)

Masa-aki FUKASE

2007-12-01

Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.
Multi-core processing and scheduling performance in CMS

CERN Multimedia

CERN. Geneva

2012-01-01

Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-...

A survey of Tumult, a real-time multi-processor system

International Nuclear Information System (INIS)

Jansen, P.G.

1986-01-01

Tumult (Twente University MULTi processor system) is the name of an ongoing project aiming at the design and implementation of a modular extendible multiprocessor system. All memory is distributed and processors communicate in parallel via a fast and reliable local switching network instead of a shared bus. A distributed real-time operating system is being designed and implemented, consisting of a multi-tasking subsystem per processor. Processes can communicate via a message passing mechanism. Communication links and processes are dynamically created and disposed by the application. In this article a brief description of the system is given; communication aspects are emphasized. (Auth.)
Heterogeneous Multicore Parallel Programming for Graphics Processing Units

Directory of Open Access Journals (Sweden)

Francois Bodin

2009-01-01

Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.
Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

Energy Technology Data Exchange (ETDEWEB)

Tumeo, Antonino; Villa, Oreste; Secchi, Simone; Chavarría-Miranda, Daniel

2013-12-12

String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearly proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are
Ring-array processor distribution topology for optical interconnects

Science.gov (United States)

Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

1992-01-01

The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

Science.gov (United States)

Zhang, Zhen; Ma, Cheng; Zhu, Rong

2017-08-23

Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System

Directory of Open Access Journals (Sweden)

Zhen Zhang

2017-08-01

Full Text Available Artificial Neural Networks (ANNs, including Deep Neural Networks (DNNs, have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP. The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.
Options for Parallelizing a Planning and Scheduling Algorithm

Science.gov (United States)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

2011-01-01

Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
The Associative Memory Serial Link Processor of the ALTAS Fast TracKer Processing System

CERN Document Server

Sotiropoulou, Calliope Louisa; The ATLAS collaboration

2017-01-01

The upgraded trigger system of the ATLAS experiment at LHC will improve the capability of the detectors to select the events with the greatest scientific potential. The FastTracker (FTK) is one of the ATLAS trigger upgrade that is presently under commissioning. FTK is a hardware system that feeds the High Level Trigger with charged particle tracks reconstructed from hits in silicon detectors at the rate of 105 events per second. Once a track candidate is found, the track reconstruction proceeds by matching low resolution hits to predefined patterns. Selected hits matching the predefined pattern are used in a second processing step for a more precise track fitting algorithm. The main processing element of FTK is the Associative Memory (AM) system that is used to perform pattern matching with high degree of parallelism. Its implementation is called the AM Board Serial Link Processor (AMBSLP) and it is a very efficient pattern matching machine that handles massively parallel data. The AMB SLP consists of two typ...
Recovery Act: Integrated DC-DC Conversion for Energy-Efficient Multicore Processors

Energy Technology Data Exchange (ETDEWEB)

Shepard, Kenneth L

2013-03-31

devices. These new approaches to scaled voltage regulation for computing devices also promise significant impact on electricity consumption in the United States and abroad by improving the efficiency of all computational platforms. In 2006, servers and datacenters in the United States consumed an estimated 61 billion kWh or about 1.5% of the nation's total energy consumption. Federal Government servers and data centers alone accounted for about 10 billion kWh, for a total annual energy cost of about $450 million. Based upon market growth and efficiency trends, estimates place current server and datacenter power consumption at nearly 85 billion kWh in the US and at almost 280 billion kWh worldwide. Similar estimates place national desktop, mobile and portable computing at 80 billion kWh combined. While national electricity utilization for computation amounts to only 4% of current usage, it is growing at a rate of about 10% a year with volume servers representing one of the largest growth segments due to the increasing utilization of cloud-based services. The percentage of power that is consumed by the processor in a server varies but can be as much as 30% of the total power utilization, with an additional 50% associated with heat removal. The approaches considered here should allow energy efficiency gains as high as 30% in processors for all computing platforms, from high-end servers to smart phones, resulting in a direct annual energy savings of almost 15 billion kWh nationally, and 50 billion kWh globally. The work developed here is being commercialized by the start-up venture, Ferric Semiconductor, which has already secured two Phase I SBIR grants to bring these technologies to the marketplace.
Power consumption in multicore fibre networks

DEFF Research Database (Denmark)

Nooruzzaman, Md; Jain, Saurabh; Jung, Yongmin

2017-01-01

We study potential energy savings in MCF-based networks compared to SMF-based ones in a Pan-European network topology based on the power consumption of recently fabricated cladding-pumped multi-core optical fibre amplifiers....
PERI - auto-tuning memory-intensive kernels for multicore

International Nuclear Information System (INIS)

Williams, S; Carter, J; Oliker, L; Shalf, J; Yelick, K; Bailey, D; Datta, K

2008-01-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse matrix vector multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the high-performance computing literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4x improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications
PERI - Auto-tuning Memory Intensive Kernels for Multicore

Energy Technology Data Exchange (ETDEWEB)

Bailey, David H; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine; Bailey, David H

2008-06-24

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.
Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine

2008-02-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.
Lattice Boltzmann simulation optimization on leading multicore platforms

Energy Technology Data Exchange (ETDEWEB)

Williams, S. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Carter, J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Oliker, L. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Shalf, J. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Yelick, K. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)

2008-01-01

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.
The UA1 upgrade calorimeter trigger processor

International Nuclear Information System (INIS)

Bains, N.; Baird, S.A.; Biddulph, P.

1990-01-01

The increased luminosity of the improved CERN Collider and the more subtle signals of second-generation collider physics demand increasingly sophisticated triggering. We have built a new first-level trigger processor designed to use the excellent granularity of the UA1 upgrade calorimeter. This device is entirely digital and handles events in 1.5 μs, thus introducing no deadtime. Its most novel feature is fast two-dimensional electromagnetic cluster-finding with the possibility of demanding an isolated shower of limited penetration. The processor allows multiple combinations of triggers on electromagnetic showers, hadronic jets and energy sums, including a total-energy veto of multiple interactions and a full vector sum of missing transverse energy. This hard-wired processor is about five times more powerful than its predecessor, and makes extensive use of pipelining techniques. It was used extensively in the 1988 and 1989 runs of the CERN Collider. (author)
The UA1 upgrade calorimeter trigger processor

International Nuclear Information System (INIS)

Bains, M.; Charleton, D.; Ellis, N.; Garvey, J.; Gregory, J.; Jimack, M.P.; Jovanovic, P.; Kenyon, I.R.; Baird, S.A.; Campbell, D.; Cawthraw, M.; Coughlan, J.; Flynn, P.; Galagedera, S.; Grayer, G.; Halsall, R.; Shah, T.P.; Stephens, R.; Biddulph, P.; Eisenhandler, E.; Fensome, I.F.; Landon, M.; Robinson, D.; Oliver, J.; Sumorok, K.

1990-01-01

The increased luminosity of the improved CERN Collider and the more subtle signals of second-generation collider physics demand increasingly sophisticated triggering. We have built a new first-level trigger processor designed to use the excellent granularity of the UA1 upgrade calorimeter. This device is entirely digital and handles events in 1.5 μs, thus introducing no dead time. Its most novel feature is fast two-dimensional electromagnetic cluster-finding with the possibility of demanding an isolated shower of limited penetration. The processor allows multiple combinations of triggers on electromagnetic showers, hadronic jets and energy sums, including a total-energy veto of multiple interactions and a full vector sum of missing transverse energy. This hard-wired processor is about five times more powerful than its predecessor, and makes extensive use of pipelining techniques. It was used extensively in the 1988 and 1989 runs of the CERN Collider. (orig.)
Graphics processor efficiency for realization of rapid tabular computations

International Nuclear Information System (INIS)

Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.

2016-01-01

Capabilities of graphics processing units (GPU) and central processing units (CPU) have been investigated for realization of fast-calculation algorithms with the use of tabulated functions. The realization of tabulated functions is exemplified by the GPU/CPU architecture-based processors. Comparison is made between the operating efficiencies of GPU and CPU, employed for tabular calculations at different conditions of use. Recommendations are formulated for the use of graphical and central processors to speed up scientific and engineering computations through the use of tabulated functions
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Science.gov (United States)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Directory of Open Access Journals (Sweden)

Cerati Giuseppe

2017-01-01

Full Text Available For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU, ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC, for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

Energy Technology Data Exchange (ETDEWEB)

Cerati, Giuseppe [Fermilab; Elmer, Peter [Princeton U.; Krutelyov, Slava [UC, San Diego; Lantz, Steven [Cornell U.; Lefebvre, Matthieu [Princeton U.; Masciovecchio, Mario [UC, San Diego; McDermott, Kevin [Cornell U.; Riley, Daniel [Cornell U., LNS; Tadel, Matevž [UC, San Diego; Wittich, Peter [Cornell U.; Würthwein, Frank [UC, San Diego; Yagil, Avi [UC, San Diego

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Energy Technology Data Exchange (ETDEWEB)

Dong, Fengguang [Univ. of Tennessee, Knoxville, TN (United States); Tomov, Stanimire [Univ. of Tennessee, Knoxville, TN (United States); Dongarra, Jack [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

2011-06-01

We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations e ciently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allocate data to the host system and GPUs to minimize communication. We have designed heterogeneous algorithms with two di erent tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scalability, strong scalability, load balance, and e ciency of our approach.
Parallel processing architecture for H.264 deblocking filter on multi-core platforms

Science.gov (United States)

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-03-01

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking
Towards industry strength mapping of AUTOSAR automotive functionality on multicore architectures

DEFF Research Database (Denmark)

Avasalcai, Cosmin Florin; Budhrani, Dhanesh; Pop, Paul

2017-01-01

The automotive electronic architectures have moved from federated architectures, where one function is implemented in one ECU (Electronic Control Unit), to distributed architectures, consisting of several multicore ECUs. In addition, multicore ECUs are being adopted because of better performance,...... engineer in the mapping task. We have successfully evaluated AUTOMAP on several realistic use cases from Volvo Trucks....
Target Localization by Resolving the Time Synchronization Problem in Bistatic Radar Systems Using Space Fast-Time Adaptive Processor

Directory of Open Access Journals (Sweden)

D. Madurasinghe

2009-01-01

Full Text Available The proposed technique allows the radar receiver to accurately estimate the range of a large number of targets using a transmitter of opportunity as long as the location of the transmitter is known. The technique does not depend on the use of communication satellites or GPS systems, instead it relies on the availability of the direct transmit copy of the signal from the transmitter and the reflected paths off the various targets. An array-based space-fast time adaptive processor is implemented in order to estimate the path difference between the direct signal and the delayed signal, which bounces off the target. This procedure allows us to estimate the target distance as well as bearing.
Interfacing Hardware Accelerators to a Time-Division Multiplexing Network-on-Chip

DEFF Research Database (Denmark)

Pezzarossa, Luca; Sørensen, Rasmus Bo; Schoeberl, Martin

2015-01-01

This paper addresses the integration of stateless hardware accelerators into time-predictable multi-core platforms based on time-division multiplexing networks-on-chip. Stateless hardware accelerators, like floating-point units, are typically attached as co-processors to individual processors in ...... implementation. The design evaluation is carried out using the open source T-CREST multi-core platform implemented on an Altera Cyclone IV FPGA. The size of the proposed design, including a floating-point accelerator, is about two-thirds of a processor....
Reconfigurable Multicore Architectures for Streaming Applications

NARCIS (Netherlands)

Smit, Gerardus Johannes Maria; Kokkeler, Andre B.J.; Rauwerda, G.K.; Jacobs, J.W.M.; Nicolescu, G.; Mosterman, P.J.

2009-01-01

This chapter addresses reconfigurable heterogenous and homogeneous multicore system-on-chip (SoC) platforms for streaming digital signal processing applications, also called DSP applications. In streaming DSP applications, computations can be specified as a data flow graph with streams of data items
DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

Directory of Open Access Journals (Sweden)

Kaufmann Michael

2004-09-01

Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.
Influence of fibre design and curvature on crosstalk in multi-core fibre

International Nuclear Information System (INIS)

Egorova, O N; Astapovich, M S; Semjonov, S L; Dianov, E M; Melnikov, L A; Salganskii, M Yu; Mishkin, S N; Nishchev, K N

2016-01-01

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)
Influence of fibre design and curvature on crosstalk in multi-core fibre

Energy Technology Data Exchange (ETDEWEB)

Egorova, O N; Astapovich, M S; Semjonov, S L; Dianov, E M [Fiber Optics Research Center, Russian Academy of Sciences, Moscow (Russian Federation); Melnikov, L A [Kotel' nikov Institute of Radio Engineering and Electronics, Russian Academy of Sciences, Saratov Branch, Saratov (Russian Federation); Salganskii, M Yu [G.G.Devyatykh Institute of Chemistry of High-Purity Substances, Russian Academy of Sciences, Nizhnii Novgorod (Russian Federation); Mishkin, S N; Nishchev, K N [N.P. Ogarev Mordovia State University, Physics and Chemistry Institute, Saransk (Russian Federation)

2016-03-31

We have studied the influence of cross-sectional structure and bends on optical cross-talk in a multicore fibre. A reduced refractive index layer produced between the cores of such fibre with a small centre-to-centre spacing between neighbouring cores (27 μm) reduces optical cross-talk by 20 dB. The cross-talk level achieved, 30 dB per kilometre of the length of the multicore fibre, is acceptable for a number of applications where relatively small lengths of fibre are needed. Moreover, a significant decrease in optical cross-talk has been ensured by reducing the winding diameter of multicore fibres with identical cores. (fiber optics)
Multicore Programming Challenges

Science.gov (United States)

Perrone, Michael

The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
The Associative Memory Serial Link Processor of the ALTAS Fast TracKer Processing System

CERN Document Server

Sotiropoulou, Calliope Louisa; The ATLAS collaboration

2017-01-01

The upgraded Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment at the LHC will improve the capability of the detector to select the events with the greatest scientific potential. The Fast TracKer (FTK) is one of the ATLAS TDAQ upgrades that is presently under commissioning. FTK is a custom hardware system that feeds the High Level Trigger (HLT) with charged particle tracks reconstructed from hits in silicon detectors at the rate of 105 events per second. The main processing element of FTK is the Associative Memory (AM) system that is used to perform pattern matching with a high degree of parallelism. Its implementation is called the AM Board Serial Link Processor (AMBSLP) and it is a very efficient pattern matching machine that handles in parallel massive data samples. The AMBSLP consists of two types of boards: the Little Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME motherboard that hosts four LAMB daughter-boar...
Bulk-memory processor for data acquisition

International Nuclear Information System (INIS)

Nelson, R.O.; McMillan, D.E.; Sunier, J.W.; Meier, M.; Poore, R.V.

1981-01-01

To meet the diverse needs and data rate requirements at the Van de Graaff and Weapons Neutron Research (WNR) facilities, a bulk memory system has been implemented which includes a fast and flexible processor. This bulk memory processor (BMP) utilizes bit slice and microcode techniques and features a 24 bit wide internal architecture allowing direct addressing of up to 16 megawords of memory and histogramming up to 16 million counts per channel without overflow. The BMP is interfaced to the MOSTEK MK 8000 bulk memory system and to the standard MODCOMP computer I/O bus. Coding for the BMP both at the microcode level and with macro instructions is supported. The generalized data acquisition system has been extended to support the BMP in a manner transparent to the user
Trigger and decision processors

International Nuclear Information System (INIS)

Franke, G.

1980-11-01

In recent years there have been many attempts in high energy physics to make trigger and decision processes faster and more sophisticated. This became necessary due to a permanent increase of the number of sensitive detector elements in wire chambers and calorimeters, and in fact it was possible because of the fast developments in integrated circuits technique. In this paper the present situation will be reviewed. The discussion will be mainly focussed upon event filtering by pure software methods and - rather hardware related - microprogrammable processors as well as random access memory triggers. (orig.)
High performance in silico virtual drug screening on many-core processors

Science.gov (United States)

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
High performance in silico virtual drug screening on many-core processors.

Science.gov (United States)

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
Document Image Parsing and Understanding using Neuromorphic Architecture

Science.gov (United States)

2015-03-01

processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored to reduce the processing...developed to reduce the processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored... cortex where the complex data is reduced to abstract representations. The abstract representation is compared to stored patterns in massively parallel
gFEX, the ATLAS Calorimeter Level-1 Real Time Processor

CERN Document Server

AUTHOR|(SzGeCERN)759889; The ATLAS collaboration; Begel, Michael; Chen, Hucheng; Lanni, Francesco; Takai, Helio; Wu, Weihao

2016-01-01

The global feature extractor (gFEX) is a component of the Level-1 Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Vertex Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 276 optical fibers with the data transferred at the 40 MHz Large Hadron Collider (LHC) clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor Field-Programmable Gate Array (FPGAs), monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA ...
gFEX, the ATLAS Calorimeter Level 1 Real Time Processor

CERN Document Server

Tang, Shaochun; The ATLAS collaboration

2015-01-01

The global feature extractor (gFEX) is a component of the Level-1Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 264 optical fibers with the data transferred at the 40 MHz LHC clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor FPGAs, monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA has been designed for testing and verification. The performance ...
Performance modeling and analysis of parallel Gaussian elimination on multi-core computers

Directory of Open Access Journals (Sweden)

Fadi N. Sibai

2014-01-01

Full Text Available Gaussian elimination is used in many applications and in particular in the solution of systems of linear equations. This paper presents mathematical performance models and analysis of four parallel Gaussian Elimination methods (precisely the Original method and the new Meet in the Middle –MiM– algorithms and their variants with SIMD vectorization on multi-core systems. Analytical performance models of the four methods are formulated and presented followed by evaluations of these models with modern multi-core systems’ operation latencies. Our results reveal that the four methods generally exhibit good performance scaling with increasing matrix size and number of cores. SIMD vectorization only makes a large difference in performance for low number of cores. For a large matrix size (n ⩾ 16 K, the performance difference between the MiM and Original methods falls from 16× with four cores to 4× with 16 K cores. The efficiencies of all four methods are low with 1 K cores or more stressing a major problem of multi-core systems where the network-on-chip and memory latencies are too high in relation to basic arithmetic operations. Thus Gaussian Elimination can greatly benefit from the resources of multi-core systems, but higher performance gains can be achieved if multi-core systems can be designed with lower memory operation, synchronization, and interconnect communication latencies, requirements of utmost importance and challenge in the exascale computing age.
Hardware Locks with Priority Ceiling Emulation for a Java Chip-Multiprocessor

DEFF Research Database (Denmark)

Strøm, Torur Biskopstø; Schoeberl, Martin

2015-01-01

According to the safety-critical Java specification, priority ceiling emulation is a requirement for implementations, as it has preferable properties, such as avoiding priority inversion and being deadlock free on uni-core systems. In this paper we explore our hardware supported implementation...... of priority ceiling emulation on the multicore Java optimized processor, and compare it to the existing hardware locks on the Java optimized processor. We find that the additional overhead for priority ceiling emulation on a multicore processor is several times higher than simpler, non-premptive locks, mainly...

Novel Crosstalk Measurement Method for Multi-Core Fiber Fan-In/Fan-Out Devices

DEFF Research Database (Denmark)

Ye, Feihong; Ono, Hirotaka; Abe, Yoshiteru

2016-01-01

We propose a new crosstalk measurement method for multi-core fiber fan-in/fan-out devices utilizing the Fresnel reflection. Compared with the traditional method using core-to-core coupling between a multi-core fiber and a single-mode fiber, the proposed method has the advantages of high reliability...
Polytopol computing for multi-core and distributed systems

Science.gov (United States)

Spaanenburg, Henk; Spaanenburg, Lambert; Ranefors, Johan

2009-05-01

Multi-core computing provides new challenges to software engineering. The paper addresses such issues in the general setting of polytopol computing, that takes multi-core problems in such widely differing areas as ambient intelligence sensor networks and cloud computing into account. It argues that the essence lies in a suitable allocation of free moving tasks. Where hardware is ubiquitous and pervasive, the network is virtualized into a connection of software snippets judiciously injected to such hardware that a system function looks as one again. The concept of polytopol computing provides a further formalization in terms of the partitioning of labor between collector and sensor nodes. Collectors provide functions such as a knowledge integrator, awareness collector, situation displayer/reporter, communicator of clues and an inquiry-interface provider. Sensors provide functions such as anomaly detection (only communicating singularities, not continuous observation), they are generally powered or self-powered, amorphous (not on a grid) with generation-and-attrition, field re-programmable, and sensor plug-and-play-able. Together the collector and the sensor are part of the skeleton injector mechanism, added to every node, and give the network the ability to organize itself into some of many topologies. Finally we will discuss a number of applications and indicate how a multi-core architecture supports the security aspects of the skeleton injector.
Real-time trajectory optimization on parallel processors

Science.gov (United States)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
The associative memory system for the FTK processor at ATLAS

CERN Document Server

Magalotti, D; The ATLAS collaboration; Donati, S; Luciano, P; Piendibene, M; Giannetti, P; Lanza, A; Verzellesi, G; Sakellariou, Andreas; Billereau, W; Combe, J M

2014-01-01

In high energy physics experiments, the most interesting processes are very rare and hidden in an extremely large level of background. As the experiment complexity, accelerator backgrounds, and instantaneous luminosity increase, more effective and accurate data selection techniques are needed. The Fast TracKer processor (FTK) is a real time tracking processor designed for the ATLAS trigger upgrade. The FTK core is the Associative Memory system. It provides massive computing power to minimize the processing time of complex tracking algorithms executed online. This paper reports on the results and performance of a new prototype of Associative Memory system.
Design Tools for Accelerating Development and Usage of Multi-Core Computing Platforms

Science.gov (United States)

2014-04-01

Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey...multicore PDSP platforms. The GPU- based capabilities of TDIF are currently oriented towards NVIDIA GPUs, based on the Compute Unified Device Architecture...CUDA) programming language [ NVIDIA 2007], which can be viewed as an extension of C. The multicore PDSP capabilities currently in TDIF are oriented
HEP specific benchmarks of virtual machines on multi-core CPU architectures

International Nuclear Information System (INIS)

Alef, M; Gable, I

2010-01-01

Virtualization technologies such as Xen can be used in order to satisfy the disparate and often incompatible system requirements of different user groups in shared-use computing facilities. This capability is particularly important for HEP applications, which often have restrictive requirements. The use of virtualization adds flexibility, however, it is essential that the virtualization technology place little overhead on the HEP application. We present an evaluation of the practicality of running HEP applications in multiple Virtual Machines (VMs) on a single multi-core Linux system. We use the benchmark suite used by the HEPiX CPU Benchmarking Working Group to give a quantitative evaluation relevant to the HEP community. Benchmarks are packaged inside VMs and then the VMs are booted onto a single multi-core system. Benchmarks are then simultaneously executed on each VM to simulate highly loaded VMs running HEP applications. These techniques are applied to a variety of multi-core CPU architectures and VM configurations.
A Parallel Encryption Algorithm Based on Piecewise Linear Chaotic Map

Directory of Open Access Journals (Sweden)

Xizhong Wang

2013-01-01

Full Text Available We introduce a parallel chaos-based encryption algorithm for taking advantage of multicore processors. The chaotic cryptosystem is generated by the piecewise linear chaotic map (PWLCM. The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI. The algorithm is suitable not only for multicore processors but also for the single-processor architecture. The experimental results show that the chaos-based cryptosystem possesses good statistical properties. The parallel algorithm provides much better performance than the serial ones and would be useful to apply in encryption/decryption file with large size or multimedia.
Digital implementation of the preloaded filter pulse processor

International Nuclear Information System (INIS)

Westphal, G.P.; Cadek, G.R.; Keroe, N.; Sauter, TH.; Thorwartl, P.C.

1995-01-01

Adapting it's processing time to the respective pulse intervals, the Preloaded Filter (PLF) pulse processor offers optimum resolution together with highest possible throughput rates. The PLF algorithm could be formulated in a recursive manner which made possible it's implementation by means of a large field-programmable gate array, as a fast, pipe-lined digital processor with 10 MHz maximum throughput rate. While pre-filter digitization by an ADC with 12 bit resolution and 10M Hz sampling rate resulted in a poorer resolution than that of an analog filter, a digital PLF based on an ADC with 14 bit resolution and 10 MHz sampling rate, surpassed high-quality analog filters in resolution, throughput rate and long-term stability. (author) 6 refs.; 7 figs
Integrated development environment for multi-core systems

Directory of Open Access Journals (Sweden)

Krunić Momčilo V.

2014-01-01

Full Text Available Development of the software application that provides comfortable working environment of embedded software applications was always a difficult task to achieve. To reach this goal it was necessary to integrate all specific tools designed for that purpose. This paper describes Integrated Development Environment (IDE that was developed to meet all specific needs of a software development for the family of multi-core target platforms designed for a digital signal processing in Cirrus Logic Company. Eclipse platform and RCP (Rich Client Platform was used as a basis, because it provides an extensible plug-in system for customizing the development environment. CLIDE (Cirrus Logic Integrated Development Environment represent the epilog of that effort, reliable IDE used for development of embedded applications. Validation of the solution is accomplished thru 2641 J Unit tests that validate most of the CLIDE's functionalities. Developed IDE (CLIDE significantly increases a quality of a software development for multi-core systems and reduces time-to-market, thereby justifying development costs.
Neural networks within multi-core optic fibers.

Science.gov (United States)

Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

2016-07-07

Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.
Programmable gain equalizer for multi-core fiber amplifiers

NARCIS (Netherlands)

Fontaine, N.K.; Guan, B.; Ryf, R.; Chen, H.; Koonen, A.M.J.; Ben Yoo, S.J.; Abedin, K.; Fini, J.; Taunay, T.F.; Neilson, D.T.

2014-01-01

We demonstrate a programmable gain equalizer for 7-core fiber that can independently equalize spectra or block wavelengths in each core across the C-band. It is spliced directly to a side-pumped multi-core amplifying fiber.
3D-LIN: A Configurable Low-Latency Interconnect for Multi-Core Clusters with 3D Stacked L1 Memory

OpenAIRE

Beanato, Giulia; Loi, Igor; De Micheli, Giovanni; Leblebici, Yusuf; Benini, Luca

2012-01-01

Shared L1 memories are of interest for tightly- coupled processor clusters in programmable accelerators as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of a shared-L1 memory critically depends on the architecture of the low-latency interconnect between processors and memory banks, which needs to provide ultra-fast access to the largest possible L1 working set. The advent of 3D technology provides new opportunities to improve the...
ATLAS FTK: Fast Track Trigger

CERN Document Server

Volpi, Guido; The ATLAS collaboration

2015-01-01

An overview of the ATLAS Fast Tracker processor is presented, reporting the design of the system, its expected performance, and the integration status. The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge to the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency in interesting events, despite the increase in multiple p-p collisions per bunch crossing (pile-up). In order to increase the use of tracks within the High Level Trigger (HLT), the ATLAS experiment planned the installation of an hardware processor dedicated to tracking: the Fast TracKer (FTK) processor. The FTK is designed to perform full scan track reconstruction at every Level-1 accept. To achieve this goal, the FTK uses a fully parallel architecture, with algorithms designed to exploit the computing power of custom VLSI chips, the Associative Memory, as well as modern FPGAs. The FT...
Multicore fibers for high-capacity submarine transmission systems

DEFF Research Database (Denmark)

Nooruzzaman, Md.; Morioka, Toshio

2018-01-01

Applications of multicore fibers (MCFs) in undersea transmission systems are investigated, and various potential architectures of branching units for MCF-based undersea transmission systems are presented. Some MCF-based submarine network architectures based on the amount of data traffic are also...
Development and validation of a two-dimensional fast-response flood estimation model

Energy Technology Data Exchange (ETDEWEB)

Judi, David R [Los Alamos National Laboratory; Mcpherson, Timothy N [Los Alamos National Laboratory; Burian, Steven J [UNIV OF UTAK

2009-01-01

A finite difference formulation of the shallow water equations using an upwind differencing method was developed maintaining computational efficiency and accuracy such that it can be used as a fast-response flood estimation tool. The model was validated using both laboratory controlled experiments and an actual dam breach. Through the laboratory experiments, the model was shown to give good estimations of depth and velocity when compared to the measured data, as well as when compared to a more complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies show that a relatively numerical scheme used to solve the complete shallow water equations can be used to accurately estimate flood inundation. Future work will focus on further reducing the computation time needed to provide flood inundation estimates for fast-response analyses. This will be accomplished through the efficient use of multi-core, multi-processor computers coupled with an efficient domain-tracking algorithm, as well as an understanding of the impacts of grid resolution on model results.
A VLSI image processor via pseudo-mersenne transforms

International Nuclear Information System (INIS)

Sei, W.J.; Jagadeesh, J.M.

1986-01-01

The computational burden on image processing in medical fields where a large amount of information must be processed quickly and accurately has led to consideration of special-purpose image processor chip design for some time. The very large scale integration (VLSI) resolution has made it cost-effective and feasible to consider the design of special purpose chips for medical imaging fields. This paper describes a VLSI CMOS chip suitable for parallel implementation of image processing algorithms and cyclic convolutions by using Pseudo-Mersenne Number Transform (PMNT). The main advantages of the PMNT over the Fast Fourier Transform (FFT) are: (1) no multiplications are required; (2) integer arithmetic is used. The design and development of this processor, which operates on 32-point convolution or 5 x 5 window image, are described
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

Science.gov (United States)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Communication Driven Mapping of Applications on Multicore Platforms

NARCIS (Netherlands)

Ashraf, I.

2016-01-01

Though transistor scaling yields more transistors per chip, however, the consistent performance gain due to frequency scaling is no more feasible due to physical limits. These trends shifted the computational paradigm towards integration of more and more processing cores. Multicore computing is
Hardware/software virtualization for the reconfigurable multicore platform.

NARCIS (Netherlands)

Ferger, M.; Al Kadi, M.; Hübner, M.; Koedam, M.L.P.J.; Sinha, S.S.; Goossens, K.G.W.; Marchesan Almeida, Gabriel; Rodrigo Azambuja, J.; Becker, Juergen

2012-01-01

This paper presents the Flex Tiles approach for the virtualization of hardware and software for a reconfigurable multicore architecture. The approach enables the virtualization of a dynamic tile-based hardware architecture consisting of processing tiles connected via a network-on-chip and a
Automatic Parallelization of Serial Programs for NASA Space‐Based Computing Systems

Data.gov (United States)

National Aeronautics and Space Administration — Since improvements in clock speed in computing processors have ceased, but silicon density continues to grow, a natural result has been that multi-core processors...

Treecode with a Special-Purpose Processor

Science.gov (United States)

Makino, Junichiro

1991-08-01

We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops.
Theoretical Investigation of Inter-core Crosstalk Properties in Homogeneous Trench-Assisted Multi-Core Fibers

DEFF Research Database (Denmark)

Ye, Feihong; Morioka, Toshio; Tu, Jiajing

2014-01-01

We derive analytical expressions for inter-core crosstalk, its dependence on core pitch and wavelength in homogeneous trench-assisted multi-core fibers. They are in excellent agreement with numerical simulation results.......We derive analytical expressions for inter-core crosstalk, its dependence on core pitch and wavelength in homogeneous trench-assisted multi-core fibers. They are in excellent agreement with numerical simulation results....
Core-to-core uniformity improvement in multi-core fiber Bragg gratings

Science.gov (United States)

Lindley, Emma; Min, Seong-Sik; Leon-Saval, Sergio; Cvetojevic, Nick; Jovanovic, Nemanja; Bland-Hawthorn, Joss; Lawrence, Jon; Gris-Sanchez, Itandehui; Birks, Tim; Haynes, Roger; Haynes, Dionne

2014-07-01

Multi-core fiber Bragg gratings (MCFBGs) will be a valuable tool not only in communications but also various astronomical, sensing and industry applications. In this paper we address some of the technical challenges of fabricating effective multi-core gratings by simulating improvements to the writing method. These methods allow a system designed for inscribing single-core fibers to cope with MCFBG fabrication with only minor, passive changes to the writing process. Using a capillary tube that was polished on one side, the field entering the fiber was flattened which improved the coverage and uniformity of all cores.
Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

KAUST Repository

Wu, Xingfu; Taylor, Valerie

2013-01-01

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore supercomputers: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks and Intel's MPI benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore supercomputers because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyrokinetic Toroidal Code (GTC) in magnetic fusion to validate our performance model of the hybrid application on these multicore supercomputers. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore supercomputers. © 2013 Elsevier Inc.
Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers

KAUST Repository

Wu, Xingfu

2013-12-01

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore supercomputers: IBM POWER4, POWER5+ and BlueGene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks and Intel\\'s MPI benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore supercomputers because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyrokinetic Toroidal Code (GTC) in magnetic fusion to validate our performance model of the hybrid application on these multicore supercomputers. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore supercomputers. © 2013 Elsevier Inc.
The Serial Link Processor for the Fast TracKer (FTK) at ATLAS

CERN Document Server

Annovi, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Biesuz, N; Billereau, W; Combe, J M; Citterio, M; Citraro, S; Crescioli, F; Dimas, D; Donati, S; Gentsos, C; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Rossi, E; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A; Vulliez, P

2014-01-01

The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problem inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report also on the performance of a first prototype based on the use of a min@sic AM chip, a small but complete version ...
Adaptive query parallelization in multi-core column stores

NARCIS (Netherlands)

M.M. Gawade (Mrunal); M.L. Kersten (Martin); M.M. Gawade (Mrunal); M.L. Kersten (Martin)

2016-01-01

htmlabstractWith the rise of multi-core CPU platforms, their optimal utilization for in-memory OLAP workloads using column store databases has become one of the biggest challenges. Some of the inherent limi- tations in the achievable query parallelism are due to the degree of parallelism
OpenCL Performance Evaluation on Modern Multicore CPUs

Directory of Open Access Journals (Sweden)

Joo Hwan Lee

2015-01-01

Full Text Available Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.
Program Execution on Reconfigurable Multicore Architectures

Directory of Open Access Journals (Sweden)

Sanjiva Prasad

2016-06-01

Full Text Available Based on the two observations that diverse applications perform better on different multicore architectures, and that different phases of an application may have vastly different resource requirements, Pal et al. proposed a novel reconfigurable hardware approach for executing multithreaded programs. Instead of mapping a concurrent program to a fixed architecture, the architecture adaptively reconfigures itself to meet the application's concurrency and communication requirements, yielding significant improvements in performance. Based on our earlier abstract operational framework for multicore execution with hierarchical memory structures, we describe execution of multithreaded programs on reconfigurable architectures that support a variety of clustered configurations. Such reconfiguration may not preserve the semantics of programs due to the possible introduction of race conditions arising from concurrent accesses to shared memory by threads running on the different cores. We present an intuitive partial ordering notion on the cluster configurations, and show that the semantics of multithreaded programs is always preserved for reconfigurations "upward" in that ordering, whereas semantics preservation for arbitrary reconfigurations can be guaranteed for well-synchronised programs. We further show that a simple approximate notion of efficiency of execution on the different configurations can be obtained using the notion of amortised bisimulations, and extend it to dynamic reconfiguration.
Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms

Science.gov (United States)

2009-05-01

base pair) Watson ‐ Crick strand pairs that bind perfectly within pairs, but poorly across pairs. A variety of DNA strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE
Multi-Core Emptiness Checking of Timed Büchi Automata using Inclusion Abstraction

DEFF Research Database (Denmark)

Laarman, Alfons; Olesen, Mads Chr.; Dalsgaard, Andreas

2013-01-01

This paper contributes to the multi-core model checking of timed automata (TA) with respect to liveness properties, by investigating checking of TA Büchi emptiness under the very coarse inclusion abstraction or zone subsumption, an open problem in this field. We show that in general Büchi emptiness...... parallel LTL model checking algorithm for timed automata. The algorithms are implemented in LTSmin, and experimental evaluations show the effectiveness and scalability of both contributions: subsumption halves the number of states in the real-world FDDI case study, and the multi-core algorithm yields...
The UA1 trigger processor

International Nuclear Information System (INIS)

Grayer, G.H.

1981-01-01

Experiment UA1 is a large multi-purpose spectrometer at the CERN proton-antiproton collider, scheduled for late 1981. The principal trigger is formed on the basis of the energy deposition in calorimeters. A trigger decision taken in under 2.4 microseconds can avoid dead time losses due to the bunched nature of the beam. To achieve this we have built fast 8-bit charge to digital converters followed by two identical digital processors tailored to the experiment. The outputs of groups of the 2440 photomultipliers in the calorimeters are summed to form a total of 288 input channels to the ADCs. A look-up table in RAM is used to convert the digitised photomultiplier signals to energy in one processor, combinations of input channels, and also counts the number of clusters with electromagnetic or hadronic energy above pre-determined levels. Up to twelve combinations of these conditions, together with external information, may be combined in coincidence or in veto to form the final trigger. Provision has been made for testing using simulated data in an off-line mode, and sampling real data when on-line. (orig.)
Use of Digital Signal Processors (DSP) in high energy physics experiments

International Nuclear Information System (INIS)

Crosetto, D.

1988-01-01

The FDDP - Fast Digital Data Processor - is a modular system for executing parallel digital processing algorithms to perform programmable trigger decisions or programmable on-line data reduction. Typical application involve zero suppression and pulse shape analysis. The characteristics of the system are: modularity, expandability and flexibility. (author). 4 refs, 5 figs
Resolution of the Vlasov-Maxwell system by PIC discontinuous Galerkin method on GPU with OpenCL

Directory of Open Access Journals (Sweden)

Crestetto Anaïs

2013-01-01

Full Text Available We present an implementation of a Vlasov-Maxwell solver for multicore processors. The Vlasov equation describes the evolution of charged particles in an electromagnetic field, solution of the Maxwell equations. The Vlasov equation is solved by a Particle-In-Cell method (PIC, while the Maxwell system is computed by a Discontinuous Galerkin method. We use the OpenCL framework, which allows our code to run on multicore processors or recent Graphic Processing Units (GPU. We present several numerical applications to two-dimensional test cases.
Operating system concepts for embedded multicores

OpenAIRE

Horst, Oliver; Schmidt, Adriaan

2014-01-01

Currently we can see an increasing adoption of multi-core platforms in the area of embedded systems. While these new hardware platforms offer the potential to satisfy the ever increasing demand for computational power, they pose considerable challenges with regard to software development. This affects the application software itself, but also the system design and architecture. Here, we address the consequences for operating system architecture in embedded systems. After dis-cussing current a...
Array processor architecture

Science.gov (United States)

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1983-01-01

A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
Performance enhancement of multi-core fiber transmission using real-time FPGA based pre-emphasis

DEFF Research Database (Denmark)

Hasanuzzaman, G. K.M.; Spolitis, Sandis; Salgals, T.

2017-01-01

We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA.......We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA....
High-Performance Computing Paradigm and Infrastructure

CERN Document Server

Yang, Laurence T

2006-01-01

With hyperthreading in Intel processors, hypertransport links in next generation AMD processors, multi-core silicon in today's high-end microprocessors from IBM and emerging grid computing, parallel and distributed computers have moved into the mainstream
Green Secure Processors: Towards Power-Efficient Secure Processor Design

Science.gov (United States)

Chhabra, Siddhartha; Solihin, Yan

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.
Probabilistic programmable quantum processors

International Nuclear Information System (INIS)

Buzek, V.; Ziman, M.; Hillery, M.

2004-01-01

We analyze how to improve performance of probabilistic programmable quantum processors. We show how the probability of success of the probabilistic processor can be enhanced by using the processor in loops. In addition, we show that an arbitrary SU(2) transformations of qubits can be encoded in program state of a universal programmable probabilistic quantum processor. The probability of success of this processor can be enhanced by a systematic correction of errors via conditional loops. Finally, we show that all our results can be generalized also for qudits. (Abstract Copyright [2004], Wiley Periodicals, Inc.)

Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

OpenAIRE

Abdul Kareem PARCHUR; Ram Asaray SINGH

2012-01-01

High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310). The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many ke...
The LASS hardware processor

International Nuclear Information System (INIS)

Kunz, P.F.

1976-01-01

The problems of data analysis with hardware processors are reviewed and a description is given of a programmable processor. This processor, the 168/E, has been designed for use in the LASS multi-processor system; it has an execution speed comparable to the IBM 370/168 and uses the subset of IBM 370 instructions appropriate to the LASS analysis task. (Auth.)
The hardware implementation of the CERN SPS ultrafast feedback processor demonstrator

CERN Document Server

Dusakto, J E; Fox, J D; Olsen, J; Rivetta, C H; Höfle, W

2013-01-01

An ultrafast 4GSa/s transverse feedback processor has been developed for proof-of-concept studies of feedback control of e-cloud driven and transverse mode coupled intra-bunch instabilities in the CERN SPS. This system consists of a high-speed ADC on the front end and equally fast DAC on the back end. All control and signal processing is implemented in FPGA logic. This system is capable of taking up to 16 sample slices across a single SPS bunch and processing each slice individually within a reconfigurable signal processor. This demonstrator system is a rapidly developed prototype, consisting of both commercial and custom-design components. It can stabilize the motion of a single particle bunch using closed loop feedback. The system can also run open loop as a high-speed arbitrary waveform generator and contains diagnostic features including a special ADC snapshot capture memory. This paper describes the overall system, the feedback processor and focuses on the hardware architecture, design ...
Performance Modeling of Hybrid MPI/OpenMP Scientific Applications on Large-scale Multicore Cluster Systems

KAUST Repository

Wu, Xingfu; Taylor, Valerie

2011-01-01

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore clusters: IBM POWER4, POWER5+ and Blue Gene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore clusters because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyro kinetic Toroidal Code in magnetic fusion to validate our performance model of the hybrid application on these multicore clusters. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore clusters. © 2011 IEEE.
Performance Modeling of Hybrid MPI/OpenMP Scientific Applications on Large-scale Multicore Cluster Systems

KAUST Repository

Wu, Xingfu

2011-08-01

In this paper, we present a performance modeling framework based on memory bandwidth contention time and a parameterized communication model to predict the performance of OpenMP, MPI and hybrid applications with weak scaling on three large-scale multicore clusters: IBM POWER4, POWER5+ and Blue Gene/P, and analyze the performance of these MPI, OpenMP and hybrid applications. We use STREAM memory benchmarks to provide initial performance analysis and model validation of MPI and OpenMP applications on these multicore clusters because the measured sustained memory bandwidth can provide insight into the memory bandwidth that a system should sustain on scientific applications with the same amount of workload per core. In addition to using these benchmarks, we also use a weak-scaling hybrid MPI/OpenMP large-scale scientific application: Gyro kinetic Toroidal Code in magnetic fusion to validate our performance model of the hybrid application on these multicore clusters. The validation results for our performance modeling method show less than 7.77% error rate in predicting the performance of hybrid MPI/OpenMP GTC on up to 512 cores on these multicore clusters. © 2011 IEEE.
Single-mode multicore fiber for dense space division multiplexing

DEFF Research Database (Denmark)

Sasaki, Yusuke; Amma, Yoshimichi; Takenaga, Katsuhiro

2016-01-01

Single-mode multicore fiber (SM-MCF) is attractive for high-capacity transmission. Our fabricated SM-MCFs achieve high core count and low crosstalk with a cladding diameter of 230 µm. Characteristics of fan-in/fan-out for the SM-MCFs are also investigated....
Computer organization and design the hardware/software interface

CERN Document Server

Patterson, David A

2011-01-01

This Fourth Revised Edition of Computer Organization and Design includes a complete set of updated and new exercises, along with improvements and changes suggested by instructors and students. Focusing on the revolutionary change taking place in industry today--the switch from uniprocessor to multicore microprocessors--this classic textbook has a modern and up-to-date focus on parallelism in all its forms. Examples highlighting multicore and GPU processor designs are supported with performance and benchmarking data. As with previous editions, a MIPS processor is the core used to pres
Addressing the challenges of standalone multi-core simulations in molecular dynamics

Science.gov (United States)

Ocaya, R. O.; Terblans, J. J.

2017-07-01

Computational modelling in material science involves mathematical abstractions of force fields between particles with the aim to postulate, develop and understand materials by simulation. The aggregated pairwise interactions of the material's particles lead to a deduction of its macroscopic behaviours. For practically meaningful macroscopic scales, a large amount of data are generated, leading to vast execution times. Simulation times of hours, days or weeks for moderately sized problems are not uncommon. The reduction of simulation times, improved result accuracy and the associated software and hardware engineering challenges are the main motivations for many of the ongoing researches in the computational sciences. This contribution is concerned mainly with simulations that can be done on a "standalone" computer based on Message Passing Interfaces (MPI), parallel code running on hardware platforms with wide specifications, such as single/multi- processor, multi-core machines with minimal reconfiguration for upward scaling of computational power. The widely available, documented and standardized MPI library provides this functionality through the MPI_Comm_size (), MPI_Comm_rank () and MPI_Reduce () functions. A survey of the literature shows that relatively little is written with respect to the efficient extraction of the inherent computational power in a cluster. In this work, we discuss the main avenues available to tap into this extra power without compromising computational accuracy. We also present methods to overcome the high inertia encountered in single-node-based computational molecular dynamics. We begin by surveying the current state of the art and discuss what it takes to achieve parallelism, efficiency and enhanced computational accuracy through program threads and message passing interfaces. Several code illustrations are given. The pros and cons of writing raw code as opposed to using heuristic, third-party code are also discussed. The growing trend
Automatic generation of application specific FPGA multicore accelerators

DEFF Research Database (Denmark)

Hindborg, Andreas Erik; Schleuniger, Pascal; Jensen, Nicklas Bo

2014-01-01

High performance computing systems make increasing use of hardware accelerators to improve performance and power properties. For large high-performance FPGAs to be successfully integrated in such computing systems, methods to raise the abstraction level of FPGA programming are required...... to identify optimal performance energy trade-offs points for a multicore based FPGA accelerator....
Overview of the ATLAS Fast Tracker Project

CERN Document Server

AUTHOR|(INSPIRE)INSPIRE-00025195; The ATLAS collaboration

2016-01-01

The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge for the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency for interesting events despite the increase in multiple collisions per bunch crossing. In order to increase the use of tracks within the High Level Trigger, the ATLAS experiment planned the installation of a hardware processor dedicated to tracking: the Fast TracKer processor. The Fast Tracker is designed to perform full scan track reconstruction of every event accepted by the ATLAS first level hardware trigger. To achieve this goal the system uses a parallel architecture, with algorithms designed to exploit the computing power of custom Associative Memory chips, and modern field programmable gate arrays. The processor will provide computing power to reconstruct tracks with transverse momentum greater than 1 GeV in the whole trackin...
From functional programming to multicore parallelism: A case study based on Presburger Arithmetic

DEFF Research Database (Denmark)

Dung, Phan Anh; Hansen, Michael Reichhardt

2011-01-01

, we are interested in using PA in connection with the Duration Calculus Model Checker (DCMC) [5]. There are effective decision procedures for PA including Cooper’s algorithm and the Omega Test; however, their complexity is extremely high with doubly exponential lower bound and triply exponential upper...... bound [7]. We investigate these decision procedures in the context of multicore parallelism with the hope of exploiting multicore powers. Unfortunately, we are not aware of any prior parallelism research related to decision procedures for PA. The closest work is the preliminary results on parallelism...
Computing platforms for software-defined radio

CERN Document Server

Nurmi, Jari; Isoaho, Jouni; Garzia, Fabio

2017-01-01

This book addresses Software-Defined Radio (SDR) baseband processing from the computer architecture point of view, providing a detailed exploration of different computing platforms by classifying different approaches, highlighting the common features related to SDR requirements and by showing pros and cons of the proposed solutions. Coverage includes architectures exploiting parallelism by extending single-processor environment (such as VLIW, SIMD, TTA approaches), multi-core platforms distributing the computation to either a homogeneous array or a set of specialized heterogeneous processors, and architectures exploiting fine-grained, coarse-grained, or hybrid reconfigurability. Describes a computer engineering approach to SDR baseband processing hardware; Discusses implementation of numerous compute-intensive signal processing algorithms on single and multicore platforms; Enables deep understanding of optimization techniques related to power and energy consumption of multicore platforms using several basic a...
Almagest, a new trackless ring finding algorithm

Energy Technology Data Exchange (ETDEWEB)

Lamanna, G., E-mail: gianluca.lamanna@cern.ch

2014-12-01

A fast ring finding algorithm is a crucial point to allow the use of RICH in on-line trigger selection. The present algorithms are either too slow (with respect to the incoming data rate) or need the information coming from a tracking system. Digital image techniques, assuming limited computing power (as for example Hough transform), are not perfectly robust for what concerns the noise immunity. We present a novel technique based on Ptolemy's theorem for multi-ring pattern recognition. Starting from purely geometrical considerations, this algorithm (also known as “Almagest”) allows fast and trackless rings reconstruction, with spatial resolution comparable with other offline techniques. Almagest is particularly suitable for parallel implementation on multi-cores machines. Preliminary tests on GPUs (multi-cores video card processors) show that, thanks to an execution time smaller than 10 μs per event, this algorithm could be employed for on-line selection in trigger systems. The user case of the NA62 RICH trigger, based on GPU, will be discussed. - Highlights: • A new algorithm for fast multiple ring searching in RICH detectors is presented. • The Almagest algorithm exploits the computing power of Graphics processers (GPUs). • A preliminary implementation for on-line triggering in the NA62 experiment shows encouraging results.
Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

Directory of Open Access Journals (Sweden)

Abdul Kareem PARCHUR

2012-08-01

Full Text Available High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310. The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many key application areas in modern generation. The scaling of performance in two major series of Intel Xeon processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310 has been analyzed using the performance numbers of 12 CPU2006 integer benchmarks, performance numbers that exhibit significant differences in performance. The results and analysis can be used by performance engineers, scientists and developers to better understand the performance scaling in modern generation processors.
Simulation of a parallel processor on a serial processor: The neutron diffusion equation

International Nuclear Information System (INIS)

Honeck, H.C.

1981-01-01

Parallel processors could provide the nuclear industry with very high computing power at a very moderate cost. Will we be able to make effective use of this power. This paper explores the use of a very simple parallel processor for solving the neutron diffusion equation to predict power distributions in a nuclear reactor. We first describe a simple parallel processor and estimate its theoretical performance based on the current hardware technology. Next, we show how the parallel processor could be used to solve the neutron diffusion equation. We then present the results of some simulations of a parallel processor run on a serial processor and measure some of the expected inefficiencies. Finally we extrapolate the results to estimate how actual design codes would perform. We find that the standard numerical methods for solving the neutron diffusion equation are still applicable when used on a parallel processor. However, some simple modifications to these methods will be necessary if we are to achieve the full power of these new computers. (orig.) [de
Comparing and Optimising Parallel Haskell Implementations for Multicore Machines

DEFF Research Database (Denmark)

Berthold, Jost; Marlow, Simon; Hammond, Kevin

2009-01-01

In this paper, we investigate the differences and tradeoffs imposed by two parallel Haskell dialects running on multicore machines. GpH and Eden are both constructed using the highly-optimising sequential GHC compiler, and share thread scheduling, and other elements, from a common code base. The ...
A study of lock-free based concurrent garbage collectors for multicore platform.

Science.gov (United States)

Wu, Hao; Ji, Zhen-Zhou

2014-01-01

Concurrent garbage collectors (CGC) have recently obtained extensive concern on multicore platform. Excellent designed CGC can improve the efficiency of runtime systems by exploring the full potential processing resources of multicore computers. Two major performance critical components for designing CGC are studied in this paper, stack scanning and heap compaction. Since the lock-based algorithms do not scale well, we present a lock-free solution for constructing a highly concurrent garbage collector. We adopt CAS/MCAS synchronization primitives to guarantee that the programs will never be blocked by the collector thread while the garbage collection process is ongoing. The evaluation results of this study demonstrate that our approach achieves competitive performance.
High-dimensional quantum key distribution based on multicore fiber using silicon photonic integrated circuits

DEFF Research Database (Denmark)

Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld

2017-01-01

is intrinsically limited to 1 bit/photon. Here we propose and experimentally demonstrate, for the first time, a high-dimensional quantum key distribution protocol based on space division multiplexing in multicore fiber using silicon photonic integrated lightwave circuits. We successfully realized three mutually......-dimensional quantum states, and enables breaking the information efficiency limit of traditional quantum key distribution protocols. In addition, the silicon photonic circuits used in our work integrate variable optical attenuators, highly efficient multicore fiber couplers, and Mach-Zehnder interferometers, enabling...
Automated Parallel Computing Tools for Multicore Machines and Clusters, Phase I

Data.gov (United States)

National Aeronautics and Space Administration — We propose to improve productivity of high performance computing for applications on multicore computers and clusters. These machines built from one or more chips...
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

Science.gov (United States)

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-04-01

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.

Proposed University of California Berkeley fast pulsar search machine

International Nuclear Information System (INIS)

Kulkarni, S.R.; Backer, D.C.; Werthimer, D.; Heiles, C.

1984-01-01

With the discovery of 1937+21 by Backer et al. (1982) there is much renewed interest in an all sky survey for fast pulsars. University of California Berkeley has designed and is in the process of building an innovative and powerful, stand-alone, real-time, digital signal-processor to conduct an all sky survey for pulsars with rotation rates as high as 2000 Hz and dispersion measures less than 120 cm -3 pc at 800 MHz. The machine is anticipated to be completed in the Fall of 1985. The search technique consists of obtaining a 2-dimensional Fourier transform of the microwave signal. The transform is effected in two stages: a 64-channel, 3-level digital autocorrelator provides the radio frequency to delay transform and a fast 128K-point array processor effects the time to intensity fluctuation frequency transform. The use of a digital correlator allows flexibility in the choice of the observing radio frequency. Besides, the bandwidth is not fixed as in a multi-channel filter bank. In the machine, bandwidths can range from less than a MHz to 40 MHz. In the transform plane, the signature of a pulsar consists of harmonically related peaks which lie on a straight line which passes through the origin. The increased computational demand of a fast pulsar survey will be met by a combination of multi-CPU processing and pipeline design which involves a fast array processor and five commercial 68,000-based micro-processors. 6 references, 3 figures
Power-Aware DVB-H Mobile TV System on Heterogeneous Multicore Platform

Directory of Open Access Journals (Sweden)

Chao Han-Chieh

2010-01-01

Full Text Available In mobile communication network, the mobile device integrated with TV player is a novel technology that provides TV program services to end users. As TV program is a real-time video service, it has greater technical difficulties to overcome than a traditional video file download or online streaming, especially when TV programs are played on handheld devices. A challenge is how to save power in order to provide users with longer TV program services. To address this issue, this study proposes a mobile TV system on a heterogeneous multicore platform, which utilizes a Digital Video Broadcasting-Handheld (DVB-H wireless network to receive the TV program signal, thus, saving power according to the features of DVB-H TV signal and heterogeneous multi-core.
Exact diagonalization of quantum lattice models on coprocessors

Science.gov (United States)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Point and track-finding processors for multiwire chambers

CERN Document Server

Hansroul, M

1973-01-01

The hardware processors described below are designed to be used in conjunction with multi-wire chambers. They have the characteristic of being based on computational methods in contrast to analogue procedures. In a sense, they are hardware implementations of computer programs. But, being specially designed for their purpose, they are free of the restrictions imposed by the architecture of the computer on which the equivalent program is to run. The parallelism inherent in the algorithms can thus be fully exploited. Combined with the use of fast access scratch-pad memories and the non-sequential nature of the control program, the parallelism accounts for the fact that these processors are expected to execute 2-3 orders of magnitude faster than the equivalent Fortran programs on a CDC 7600 or 6600. As a consequence, methods which are simple and straightforward, but which are impractical because they require an exorbitant amount of computer time can on the contrary be very attractive for hardware implementation. ...
Efficient Co-Simulation of Multicore Systems

DEFF Research Database (Denmark)

Brock-Nannestad, Laust; Karlsson, Sven

2011-01-01

the hardware state of a multicore design while it is running on an FPGA. With minimal changes to the design and using only the built-in JTAG programming and debug- ging facilities, we describe how to transfer the state from an FPGA to a simulator. We also show how the state can be transferred back from...... the simulator to FPGA. Given that the design runs in real-time on the FPGA, the end result is speed improvements of orders of magnitude over traditional pure software simulation....
Smart Multicore Embedded Systems

DEFF Research Database (Denmark)

This book provides a single-source reference to the state-of-the-art of high-level programming models and compilation tool-chains for embedded system platforms. The authors address challenges faced by programmers developing software to implement parallel applications in embedded systems, where very...... specificities of various embedded systems from different industries. Parallel programming tool-chains are described that take as input parameters both the application and the platform model, then determine relevant transformations and mapping decisions on the concrete platform, minimizing user intervention...... and hiding the difficulties related to the correct and efficient use of memory hierarchy and low level code generation. Describes tools and programming models for multicore embedded systems Emphasizes throughout performance per watt scalability Discusses realistic limits of software parallelization Enables...
Functional unit for a processor

NARCIS (Netherlands)

Rohani, A.; Kerkhoff, Hans G.

2013-01-01

The invention relates to a functional unit for a processor, such as a Very Large Instruction Word Processor. The invention further relates to a processor comprising at least one such functional unit. The invention further relates to a functional unit and processor capable of mitigating the effect of
Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

Science.gov (United States)

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
Adaptive signal processor

Energy Technology Data Exchange (ETDEWEB)

Walz, H.V.

1980-07-01

An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 ..mu..sec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed.
Adaptive signal processor

International Nuclear Information System (INIS)

Walz, H.V.

1980-07-01

An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 μsec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed
Multithreading in vector processors

Science.gov (United States)

Evangelinos, Constantinos; Kim, Changhoan; Nair, Ravi

2018-01-16

In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.
Pseudo-Random Number Generators for Vector Processors and Multicore Processors

DEFF Research Database (Denmark)

Fog, Agner

2015-01-01

Large scale Monte Carlo applications need a good pseudo-random number generator capable of utilizing both the vector processing capabilities and multiprocessing capabilities of modern computers in order to get the maximum performance. The requirements for such a generator are discussed. New ways...
The design of a graphics processor

International Nuclear Information System (INIS)

Holmes, M.; Thorne, A.R.

1975-12-01

The design of a graphics processor is described which takes into account known and anticipated user requirements, the availability of cheap minicomputers, the state of integrated circuit technology, and the overall need to minimise cost for a given performance. The main user needs are the ability to display large high resolution pictures, and to dynamically change the user's view in real time by means of fast coordinate processing hardware. The transformations that can be applied to 2D or 3D coordinates either singly or in combination are: translation, scaling, mirror imaging, rotation, and the ability to map the transformation origin on to any point on the screen. (author)
Dual-core Itanium Processor

CERN Multimedia

2006-01-01

Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.
A New and Simple Method for Crosstalk Estimation in Homogeneous Trench-Assisted Multi-Core Fibers

DEFF Research Database (Denmark)

Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

2014-01-01

A new and simple method for inter-core crosstalk estimation in homogeneous trench-assisted multi-core fibers is presented. The crosstalk calculated by this method agrees well with experimental measurement data for two kinds of fabricated 12-core fibers.......A new and simple method for inter-core crosstalk estimation in homogeneous trench-assisted multi-core fibers is presented. The crosstalk calculated by this method agrees well with experimental measurement data for two kinds of fabricated 12-core fibers....
A Study of Lock-Free Based Concurrent Garbage Collectors for Multicore Platform

Directory of Open Access Journals (Sweden)

Hao Wu

2014-01-01

Full Text Available Concurrent garbage collectors (CGC have recently obtained extensive concern on multicore platform. Excellent designed CGC can improve the efficiency of runtime systems by exploring the full potential processing resources of multicore computers. Two major performance critical components for designing CGC are studied in this paper, stack scanning and heap compaction. Since the lock-based algorithms do not scale well, we present a lock-free solution for constructing a highly concurrent garbage collector. We adopt CAS/MCAS synchronization primitives to guarantee that the programs will never be blocked by the collector thread while the garbage collection process is ongoing. The evaluation results of this study demonstrate that our approach achieves competitive performance.
3081/E processor

International Nuclear Information System (INIS)

Kunz, P.F.; Gravina, M.; Oxoby, G.

1984-04-01

The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future
Fast track-finding trigger processor for the SLAC/LBL Mark II Detector

International Nuclear Information System (INIS)

Brafman, H.; Breidenbach, M.; Hettel, R.; Himel, T.; Horelick, D.

1977-10-01

The SLAC/LBL Mark II Magnetic Detector consists of various particle detectors arranged in cylindrical symmetry located in and around an axial magnetic field. A versatile, programmable secondary trigger processor was designed and built to find curved tracks in the detector. The system operates at a 10 MHz clock rate with a total processing time of 34 μsec and is used to ''trigger'' the data processing computer, thereby rejecting background and greatly improving the data acquisition aspects of the detector-computer combination
Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators

KAUST Repository

Liao, Gang

2015-05-01

In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures. © 2015 IEEE.
Parallel DC3 Algorithm for Suffix Array Construction on Many-Core Accelerators

KAUST Repository

Liao, Gang; Ma, Longfei; Zang, Guangming; Tang, Lin

2015-01-01

In bioinformatics applications, suffix arrays are widely used to DNA sequence alignments in the initial exact match phase of heuristic algorithms. With the exponential growth and availability of data, using many-core accelerators, like GPUs, to optimize existing algorithms is very common. We present a new implementation of suffix array on GPU. As a result, suffix array construction on GPU achieves around 10x speedup on standard large data sets, which contain more than 100 million characters. The idea is simple, fast and scalable that can be easily scale to multi-core processors and even heterogeneous architectures. © 2015 IEEE.

Integrated fuel processor development

International Nuclear Information System (INIS)

Ahmed, S.; Pereira, C.; Lee, S. H. D.; Krumpelt, M.

2001-01-01

The Department of Energy's Office of Advanced Automotive Technologies has been supporting the development of fuel-flexible fuel processors at Argonne National Laboratory. These fuel processors will enable fuel cell vehicles to operate on fuels available through the existing infrastructure. The constraints of on-board space and weight require that these fuel processors be designed to be compact and lightweight, while meeting the performance targets for efficiency and gas quality needed for the fuel cell. This paper discusses the performance of a prototype fuel processor that has been designed and fabricated to operate with liquid fuels, such as gasoline, ethanol, methanol, etc. Rated for a capacity of 10 kWe (one-fifth of that needed for a car), the prototype fuel processor integrates the unit operations (vaporization, heat exchange, etc.) and processes (reforming, water-gas shift, preferential oxidation reactions, etc.) necessary to produce the hydrogen-rich gas (reformate) that will fuel the polymer electrolyte fuel cell stacks. The fuel processor work is being complemented by analytical and fundamental research. With the ultimate objective of meeting on-board fuel processor goals, these studies include: modeling fuel cell systems to identify design and operating features; evaluating alternative fuel processing options; and developing appropriate catalysts and materials. Issues and outstanding challenges that need to be overcome in order to develop practical, on-board devices are discussed
Overview of the ATLAS Fast Tracker Project

CERN Document Server

Ancu, Lucian Stefan; The ATLAS collaboration

2016-01-01

The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge for the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency for interesting events despite the increase in multiple collisions per bunch crossing. In order to increase the use of tracks within the High Level Trigger, the ATLAS experiment planned the installation of a hardware processor dedicated to tracking: the Fast TracKer processor. The Fast Tracker is designed to perform full scan track reconstruction of every event accepted by the ATLAS first level hardware trigger. To achieve this goal the system uses a parallel architecture, with algorithms designed to exploit the computing power of custom Associative Memory chips, and modern field programmable gate arrays. The processor will provide computing power to reconstruct tracks with transverse momentum greater than 1 GeV in the whol...
Optimization of a Lattice Boltzmann Computation on State-of-the-Art Multicore Platforms

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine

2009-04-10

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15x improvement compared with the original code at a given concurrency. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.
IGA-ADS: Isogeometric analysis FEM using ADS solver

Science.gov (United States)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Computing trends using graphic processor in high energy physics

CERN Document Server

Niculescu, Mihai

2011-01-01

One of the main challenges in Heavy Energy Physics is to make fast analysis of high amount of experimental and simulated data. At LHC-CERN one p-p event is approximate 1 Mb in size. The time taken to analyze the data and obtain fast results depends on high computational power. The main advantage of using GPU(Graphic Processor Unit) programming over traditional CPU one is that graphical cards bring a lot of computing power at a very low price. Today a huge number of application(scientific, financial etc) began to be ported or developed for GPU, including Monte Carlo tools or data analysis tools for High Energy Physics. In this paper, we'll present current status and trends in HEP using GPU.
Simple analytical expression for crosstalk estimation in homogeneous trench-assisted multi-core fibers

DEFF Research Database (Denmark)

Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

2014-01-01

An analytical expression for the mode coupling coe cient in homogeneous trench-assisted multi-core fibers is derived, which has a sim- ple relationship with the one in normal step-index structures. The amount of inter-core crosstalk reduction (in dB) with trench-assisted structures compared...... to the one with normal step-index structures can then be written by a simple expression. Comparison with numerical simulations confirms that the obtained analytical expression has very good accuracy for crosstalk estimation. The crosstalk properties in trench-assisted multi-core fibers, such as crosstalk...... dependence on core pitch and wavelength-dependent crosstalk, can be obtained by this simple analytical expression....
PERFORMANCE EVALUATION OF OR1200 PROCESSOR WITH EVOLUTIONARY PARALLEL HPRC USING GEP

Directory of Open Access Journals (Sweden)

R. Maheswari

2012-04-01

Full Text Available In this fast computing era, most of the embedded system requires more computing power to complete the complex function/ task at the lesser amount of time. One way to achieve this is by boosting up the processor performance which allows processor core to run faster. This paper presents a novel technique of increasing the performance by parallel HPRC (High Performance Reconfigurable Computing in the CPU/DSP (Digital Signal Processor unit of OR1200 (Open Reduced Instruction Set Computer (RISC 1200 using Gene Expression Programming (GEP an evolutionary programming model. OR1200 is a soft-core RISC processor of the Intellectual Property cores that can efficiently run any modern operating system. In the manufacturing process of OR1200 a parallel HPRC is placed internally in the Integer Execution Pipeline unit of the CPU/DSP core to increase the performance. The GEP Parallel HPRC is activated /deactivated by triggering the signals i HPRC_Gene_Start ii HPRC_Gene_End. A Verilog HDL(Hardware Description language functional code for Gene Expression Programming parallel HPRC is developed and synthesised using XILINX ISE in the former part of the work and a CoreMark processor core benchmark is used to test the performance of the OR1200 soft core in the later part of the work. The result of the implementation ensures the overall speed-up increased to 20.59% by GEP based parallel HPRC in the execution unit of OR1200.
NOAA Weather Wire Service

Science.gov (United States)

Gigabytes of memory for up to 60-day storage of all products 2 Gigahertz of greater multi-core processor NWS space. Requires a minimum of 3 GB of memory. Requires a minimum 2 GHz of multi-core processing. View EUC provides no warranty, expressed or implied, as to the correctness of the EUC software provided or the
A parallel architecture system dedicated to fast numerical calculus

International Nuclear Information System (INIS)

Harmanci, A.E.

1982-04-01

The project described here is the first result of a careful reflection oriented to the implementation of a machine intended for fast scientific computation, having in mind applications in the field of nuclear reactor safety. The selected structure is a data processing system of the MIMD type (Multiple Instruction, Multiple Data Stream). It is built by generalizing a basic cell constituted by associating an host processor and one or several processors dedicated to numerical computation, both operating alternatively on two areas of a common memory block. The principle of simultaneous operation of a large number of identical resources is used at every level of the structure. The system described here is hence modular and reconfigurable. The number of cells, the size and number of memory blocks may be chosen according to the needs. The communication between processors is carried out through the switching of the allocation of memory blocks. Moreover the numerical processors make the best use of private interconnections for synchronisation and fast data interchange. The present study devoted to the definition of the main hardware structures, will be followed by a simulation phase while suitable software tools will be developed [fr
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals

Directory of Open Access Journals (Sweden)

Pablo Soto-Quiros

2015-01-01

Full Text Available This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT: the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Evaluation of SuperLU on multicore architectures

International Nuclear Information System (INIS)

Li, X S

2008-01-01

The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior
Asymmetric flow field-flow fractionation of superferrimagnetic iron oxide multicore nanoparticles

DEFF Research Database (Denmark)

Dutz, Silvio; Kuntsche, Judith; Eberbeck, Dietmar

2012-01-01

Magnetic nanoparticles are very useful for various medical applications where each application requires particles with specific magnetic properties. In this paper we describe the modification of the magnetic properties of magnetic multicore nanoparticles (MCNPs) by size dependent fractionation....... The hysteresis curves were measured by vibrating sample magnetometry. Starting from a coercivity of 1.41 kA m(-1) for the original MCNPs the coercivity of the particles in the different fractions varied from 0.41 to 3.83 kA m(-1). In our paper it is shown for the first time that fractions obtained from a broad...... size distributed MCNP fluid classified by AF4 show a strong correlation between hydrodynamic diameter and magnetic properties. Thus we state that AF4 is a suitable technology for reproducible size dependent classification of magnetic multicore nanoparticles suspended as ferrofluids....
The Molen Polymorphic Media Processor

NARCIS (Netherlands)

Kuzmanov, G.K.

2004-01-01

In this dissertation, we address high performance media processing based on a tightly coupled co-processor architectural paradigm. More specifically, we introduce a reconfigurable media augmentation of a general purpose processor and implement it into a fully operational processor prototype. The
Analytic clock frequency selection for global DVFS

NARCIS (Netherlands)

Gerards, Marco Egbertus Theodorus; Hurink, Johann L.; Holzenspies, P.K.F.; Kuper, Jan; Smit, Gerardus Johannes Maria

2014-01-01

Computers can reduce their power consumption by decreasing their speed using Dynamic Voltage and Frequency Scaling (DVFS). A form of DVFS for multicore processors is global DVFS, where the voltage and clock frequency is shared among all processor cores. Because global DVFS is efficient and cheap to
Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies

OpenAIRE

Linford, John Christian

2010-01-01

The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling...
Single-step generation of metal-plasma polymer multicore@shell nanoparticles from the gas phase.

Science.gov (United States)

Solař, Pavel; Polonskyi, Oleksandr; Olbricht, Ansgar; Hinz, Alexander; Shelemin, Artem; Kylián, Ondřej; Choukourov, Andrei; Faupel, Franz; Biederman, Hynek

2017-08-17

Nanoparticles composed of multiple silver cores and a plasma polymer shell (multicore@shell) were prepared in a single step with a gas aggregation cluster source operating with Ar/hexamethyldisiloxane mixtures and optionally oxygen. The size distribution of the metal inclusions as well as the chemical composition and the thickness of the shells were found to be controlled by the composition of the working gas mixture. Shell matrices ranging from organosilicon plasma polymer to nearly stoichiometric SiO 2 were obtained. The method allows facile fabrication of multicore@shell nanoparticles with tailored functional properties, as demonstrated here with the optical response.
A model for the design and programming of multi-cores

NARCIS (Netherlands)

Jesshope, C.; Grandinetti, L.

2008-01-01

This paper describes a machine/programming model for the era of multi-core chips. It is derived from the sequential model but replaces sequential composition with concurrent composition at all levels in the program except at the level where the compiler is able to make deterministic decisions on
Wavelength-Dependence of Inter-Core Crosstalk in Homogeneous Multi-Core Fibers

DEFF Research Database (Denmark)

Ye, Feihong; Saitoh, Kunimasa; Takenaga, Katsuhiro

2016-01-01

The wavelength dependence of inter-core crosstalk in homogeneous multi-core fibers (MCFs) is investigated, and the corresponding analytical expressions are derived. The derived analytical expressions can be used to determine the crosstalk at any wavelength necessary for designing future MCF...
Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems

Science.gov (United States)

2017-04-13

AFRL-AFOSR-UK-TR-2017-0029 Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems ...2012, “ Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems .” 2. The objective...2012 - 01/25/2015 4. TITLE AND SUBTITLE Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous
A parallel and sensitive software tool for methylation analysis on multicore platforms.

Science.gov (United States)

Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

2015-10-01

DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Introducing 'bones' : a parallelizing source-to-source compiler based on algorithmic skeletons.

NARCIS (Netherlands)

Nugteren, C.; Corporaal, H.

2012-01-01

Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from
Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures

KAUST Repository

Ltaief, Hatem; Luszczek, Piotr R.; Haidar, Azzam; Dongarra, Jack

2012-01-01

This paper proposes an efficient implementation of the generalized symmetric eigenvalue problem on multicore architecture. Based on a four-stage approach and tile algorithms, the original problem is first transformed into a standard symmetric
GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops

DEFF Research Database (Denmark)

Swenson, M Shel; Anderson, Joshua; Ash, Andrew

2012-01-01

achieved significant improvements in runtime, but their implementations were not portable from niche high-performance computers or easily accessible to most RNA researchers. With the increasing prevalence of multi-core desktop machines, a new parallel prediction program is needed to take full advantage...
Time-Predictable Communication on a Time-Division Multiplexing Network-on-Chip Multicore

DEFF Research Database (Denmark)

Sørensen, Rasmus Bo

This thesis presents time-predictable inter-core communication on a multicore platform with a time-division multiplexing (TDM) network-on-chip (NoC) for hard real-time systems. The thesis is structured as a collection of papers that contribute within the areas of: reconfigurable TDM NoCs, static...... TDM scheduling, and time-predictable inter-core communication. More specifically, the work presented in this thesis investigates the interaction between hardware and software involved in time-predictable inter-core communication on the multicore platform. The thesis presents: a new generation...... of the Argo NoC network interface (NI) that supports instantaneous reconfiguration, a TDM traffic scheduler that generates virtual circuit (VC) configurations for the Argo NoC, and software functions for two types of intercore communication. The new generation of the Argo NoC adds the capability...
A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor

International Nuclear Information System (INIS)

Sotiropoulou, C L; Gkaitatzis, S; Kordas, K; Nikolaidis, S; Petridou, C; Annovi, A; Beretta, M; Volpi, G

2014-01-01

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of
The Fast Tracker Real Time Processor

CERN Document Server

Annovi, A; The ATLAS collaboration

2011-01-01

As the LHC luminosity is ramped up to the SLHC Phase I level and beyond, the high rates, multiplicities, and energies of particles seen by the detectors will pose a unique challenge. Only a tiny fraction of the produced collisions can be stored on tape and immense real-time data reduction is needed. An effective trigger system must maintain high trigger efficiencies for the physics we are most interested in, and at the same time suppress the enormous QCD backgrounds. This requires massive computing power to minimize the online execution time of complex algorithms. A multi-level trigger is an effective solution for an otherwise impossible problem. The Fast Tracker (FTK)[1], is a proposed upgrade to the current ATLAS trigger system that will operate at full Level-1 output rates and provide high quality tracks reconstructed over the entire detector by the start of processing in Level-2. FTK solves the combinatorial challenge inherent to tracking by exploiting massive parallelism of associative memories [2] that ...
Mobile Thread Task Manager

Science.gov (United States)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
Reinforced PEI/PVdF Multicore-Shell Structure Composite Membranes by Phase Prediction on a Ternary Solution

Directory of Open Access Journals (Sweden)

Jihye Chae

2018-04-01

Full Text Available To construct a polyetherimide (PEI-reinforced polyvinylidene fluoride (PVdF composite membrane with multicore-shell structure, a ternary solution was prepared and electrospun by single-nozzle electrospinning. A theoretical prediction was made for the feasibility of complete distinction of two phases. The diameters of the membrane fibers and the PEI multi-core fibrils varied with the PEI ratio and the spinning time, respectively. The tensile strength and modulus were improved to 48 MPa and 1.5 GPa, respectively. The shrinkage of the membrane was only 6.6% at 180 °C, at which temperature the commercial PE separator melted down. The reinforcement in mechanical and thermal properties is associated with multiple PEI nanofibrils oriented along the fiber axis. Indeed, the unique morphology of self-assembled multicore-shell fibers plays an important role in their properties. All in all, PEI/PVdF membranes are appropriate for a lithium-ion battery application due to their high mechanical strength, excellent thermal stability, and controllable textural properties.
A Bandwidth-Optimized Multi-Core Architecture for Irregular Applications

Energy Technology Data Exchange (ETDEWEB)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

2012-05-31

This paper presents an architecture template for next-generation high performance computing systems specifically targeted to irregular applications. We start our work by considering that future generation interconnection and memory bandwidth full-system numbers are expected to grow by a factor of 10. In order to keep up with such a communication capacity, while still resorting to fine-grained multithreading as the main way to tolerate unpredictable memory access latencies of irregular applications, we show how overall performance scaling can benefit from the multi-core paradigm. At the same time, we also show how such an architecture template must be coupled with specific techniques in order to optimize bandwidth utilization and achieve the maximum scalability. We propose a technique based on memory references aggregation, together with the related hardware implementation, as one of such optimization techniques. We explore the proposed architecture template by focusing on the Cray XMT architecture and, using a dedicated simulation infrastructure, validate the performance of our template with two typical irregular applications. Our experimental results prove the benefits provided by both the multi-core approach and the bandwidth optimization reference aggregation technique.
Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

KAUST Repository

Ltaief, Hatem; Luszczek, Piotr R.; Dongarra, Jack

2012-01-01

The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using
JPP: A Java Pre-Processor

OpenAIRE

Kiniry, Joseph R.; Cheong, Elaine

1998-01-01

The Java Pre-Processor, or JPP for short, is a parsing pre-processor for the Java programming language. Unlike its namesake (the C/C++ Pre-Processor, cpp), JPP provides functionality above and beyond simple textual substitution. JPP's capabilities include code beautification, code standard conformance checking, class and interface specification and testing, and documentation generation.
Parallel Programming with Intel Parallel Studio XE

CERN Document Server

Blair-Chappell , Stephen

2012-01-01

Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the
Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

Science.gov (United States)

Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael

2012-01-01

We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Multi-core events in cosmic-ray induced interactions with lead at around 10 TeV

International Nuclear Information System (INIS)

Amato, N.; Arata, N.

1989-01-01

The analysis is made on the cosmic-ray induced interactions with lead at around 10 TeV on the basis of emulsion chamber data at Chacaltaya. A special attention is paid to the events detected as multi-cores under the spatial resolution of a few tens of microns. The observation of six double-core events and two triple-core events with the average invariant mass of 1.8 GeV/c 2 leads to the estimation on production frequency of such multicores as about 5% at 10 TeV at the atmospheric depth 540 gr/cm 2 . (author)
Embedded memory design for multi-core and systems on chip

CERN Document Server

Mohammad, Baker

2014-01-01

This book describes the various tradeoffs systems designers face when designing embedded memory. Readers designing multi-core systems and systems on chip will benefit from the discussion of different topics from memory architecture, array organization, circuit design techniques and design for test. The presentation enables a multi-disciplinary approach to chip design, which bridges the gap between the architecture level and circuit level, in order to address yield, reliability and power-related issues for embedded memory. · Provides a comprehensive overview of embedded memory design and associated challenges and choices; · Explains tradeoffs and dependencies across different disciplines involved with multi-core and system on chip memory design; · Includes detailed discussion of memory hierarchy and its impact on energy and performance; · Uses real product examples to demonstrate embedded memory design flow from architecture, to circuit ...
Producing chopped firewood with firewood processors

International Nuclear Information System (INIS)

Kaerhae, K.; Jouhiaho, A.

2009-01-01

The TTS Institute's research and development project studied both the productivity of new, chopped firewood processors (cross-cutting and splitting machines) suitable for professional and independent small-scale production, and the costs of the chopped firewood produced. Seven chopped firewood processors were tested in the research, six of which were sawing processors and one shearing processor. The chopping work was carried out using wood feeding racks and a wood lifter. The work was also carried out without any feeding appliances. Altogether 132.5 solid m 3 of wood were chopped in the time studies. The firewood processor used had the most significant impact on chopping work productivity. In addition to the firewood processor, the stem mid-diameter, the length of the raw material, and of the firewood were also found to affect productivity. The wood feeding systems also affected productivity. If there is a feeding rack and hydraulic grapple loader available for use in chopping firewood, then it is worth using the wood feeding rack. A wood lifter is only worth using with the largest stems (over 20 cm mid-diameter) if a feeding rack cannot be used. When producing chopped firewood from small-diameter wood, i.e. with a mid-diameter less than 10 cm, the costs of chopping work were over 10 EUR solid m -3 with sawing firewood processors. The shearing firewood processor with a guillotine blade achieved a cost level of 5 EUR solid m -3 when the mid-diameter of the chopped stem was 10 cm. In addition to the raw material, the cost-efficient chopping work also requires several hundred annual operating hours with a firewood processor, which is difficult for individual firewood entrepreneurs to achieve. The operating hours of firewood processors can be increased to the required level by the joint use of the processors by a number of firewood entrepreneurs. (author)
Reconfiguration in FPGA-Based Multi-Core Platforms for Hard Real-Time Applications

DEFF Research Database (Denmark)

Pezzarossa, Luca; Schoeberl, Martin; Sparsø, Jens

2016-01-01

-case execution-time of tasks of an application that determines the systems ability to respond in time. To support this focus, the platform must provide service guarantees for both communication and computation resources. In addition, many hard real-time applications have multiple modes of operation, and each......In general-purpose computing multi-core platforms, hardware accelerators and reconfiguration are means to improve performance; i.e., the average-case execution time of a software application. In hard real-time systems, such average-case speed-up is not in itself relevant - it is the worst...... mode has specific requirements. An interesting perspective on reconfigurable computing is to exploit run-time reconfiguration to support mode changes. In this paper we explore approaches to reconfiguration of communication and computation resources in the T-CREST hard real-time multi-core platform...
Performance evaluation of canny edge detection on a tiled multicore architecture

Science.gov (United States)

Brethorst, Andrew Z.; Desai, Nehal; Enright, Douglas P.; Scrofano, Ronald

2011-01-01

In the last few years, a variety of multicore architectures have been used to parallelize image processing applications. In this paper, we focus on assessing the parallel speed-ups of different Canny edge detection parallelization strategies on the Tile64, a tiled multicore architecture developed by the Tilera Corporation. Included in these strategies are different ways Canny edge detection can be parallelized, as well as differences in data management. The two parallelization strategies examined were loop-level parallelism and domain decomposition. Loop-level parallelism is achieved through the use of OpenMP,1 and it is capable of parallelization across the range of values over which a loop iterates. Domain decomposition is the process of breaking down an image into subimages, where each subimage is processed independently, in parallel. The results of the two strategies show that for the same number of threads, programmer implemented, domain decomposition exhibits higher speed-ups than the compiler managed, loop-level parallelism implemented with OpenMP.
Multicore Education through Simulation

Science.gov (United States)

Ozturk, O.

2011-01-01

A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…
High-Spatial-Multiplicity Multicore Fibers for Future Dense Space-Division-Multiplexing Systems

DEFF Research Database (Denmark)

Matsuo, Shoichiro; Takenaga, Katsuhiro; Sasaki, Yusuke

2016-01-01

Multicore fibers and few-mode fibers have potential application in realizing dense-space-division multiplexing systems. However, there are some tradeoff requirements for designing the fibers. In this paper, the tradeoff requirements such as spatial channel count, crosstalk, differential mode dela...

Optical Associative Processors For Visual Perception"

Science.gov (United States)

Casasent, David; Telfer, Brian

1988-05-01

We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.
AMD's 64-bit Opteron processor

CERN Multimedia

CERN. Geneva

2003-01-01

This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...
Composable processor virtualization for embedded systems

NARCIS (Netherlands)

Molnos, A.M.; Milutinovic, A.; She, D.; Goossens, K.G.W.

2010-01-01

Processor virtualization divides a physical processor's time among a set of virual machines, enabling efficient hardware utilization, application security and allowing co-existence of different operating systems on the same processor. Through initially intended for the server domain, virtualization
A low power biomedical signal processor ASIC based on hardware software codesign.

Science.gov (United States)

Nie, Z D; Wang, L; Chen, W G; Zhang, T; Zhang, Y T

2009-01-01

A low power biomedical digital signal processor ASIC based on hardware and software codesign methodology was presented in this paper. The codesign methodology was used to achieve higher system performance and design flexibility. The hardware implementation included a low power 32bit RISC CPU ARM7TDMI, a low power AHB-compatible bus, and a scalable digital co-processor that was optimized for low power Fast Fourier Transform (FFT) calculations. The co-processor could be scaled for 8-point, 16-point and 32-point FFTs, taking approximate 50, 100 and 150 clock circles, respectively. The complete design was intensively simulated using ARM DSM model and was emulated by ARM Versatile platform, before conducted to silicon. The multi-million-gate ASIC was fabricated using SMIC 0.18 microm mixed-signal CMOS 1P6M technology. The die area measures 5,000 microm x 2,350 microm. The power consumption was approximately 3.6 mW at 1.8 V power supply and 1 MHz clock rate. The power consumption for FFT calculations was less than 1.5 % comparing with the conventional embedded software-based solution.
Deterministic chaos in the processor load

International Nuclear Information System (INIS)

Halbiniak, Zbigniew; Jozwiak, Ireneusz J.

2007-01-01

In this article we present the results of research whose purpose was to identify the phenomenon of deterministic chaos in the processor load. We analysed the time series of the processor load during efficiency tests of database software. Our research was done on a Sparc Alpha processor working on the UNIX Sun Solaris 5.7 operating system. The conducted analyses proved the presence of the deterministic chaos phenomenon in the processor load in this particular case
Decimal Engine for Energy-Efficient Multicore Processors

DEFF Research Database (Denmark)

Nannarelli, Alberto

2014-01-01

propose a hybrid BFP/DFP engine to perform BFP division and DFP addition, multiplication and division. The main purpose of this engine is to offload the binary floating-point units for this type of operations and reduce the latency for decimal operations, and power and temperature for the whole die....
A multicore processor for time-critical applications

DEFF Research Database (Denmark)

Schoeberl, Martin; Pezzarossa, Luca; Sparsø, Jens

2018-01-01

This article presents T-CREST many-core architecture specially designed and optimized for time-critical systems. —Tulika Mitra, National University of Singapore —Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnber —Lothar Thiele, ETH Zurich.......This article presents T-CREST many-core architecture specially designed and optimized for time-critical systems. —Tulika Mitra, National University of Singapore —Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnber —Lothar Thiele, ETH Zurich....
Design of homogeneous trench-assisted multi-core fibers based on analytical model

DEFF Research Database (Denmark)

Ye, Feihong; Tu, Jiajing; Saitoh, Kunimasa

2016-01-01

We present a design method of homogeneous trench-assisted multicore fibers (TA-MCFs) based on an analytical model utilizing an analytical expression for the mode coupling coefficient between two adjacent cores. The analytical model can also be used for crosstalk (XT) properties analysis, such as ...
Neurovision processor for designing intelligent sensors

Science.gov (United States)

Gupta, Madan M.; Knopf, George K.

1992-03-01

A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.
Digital signal processors for cryogenic high-resolution x-ray detector readout

International Nuclear Information System (INIS)

Friedrich, Stephan; Drury, Owen B.; Bechstein, Sylke; Hennig, Wolfgang; Momayezi, Michael

2003-01-01

We are developing fast digital signal processors (DSPs) to read out superconducting high-resolution X-ray detectors with on-line pulse processing. For superconducting tunnel junction (STJ) detector read-out, the DSPs offer online filtering, rise time discrimination and pile-up rejection. Compared to analog pulse processing, DSP readout somewhat degrades the detector resolution, but improves the spectral purity of the detector response. We discuss DSP performance with our 9-channel STJ array for synchrotron-based high-resolution X-ray spectroscopy. (author)
Harnessing Petaflop-Scale Multi-Core Supercomputing for Problems in Space Science

Science.gov (United States)

Albright, B. J.; Yin, L.; Bowers, K. J.; Daughton, W.; Bergen, B.; Kwan, T. J.

2008-12-01

The particle-in-cell kinetic plasma code VPIC has been migrated successfully to the world's fastest supercomputer, Roadrunner, a hybrid multi-core platform built by IBM for the Los Alamos National Laboratory. How this was achieved will be described and examples of state-of-the-art calculations in space science, in particular, the study of magnetic reconnection, will be presented. With VPIC on Roadrunner, we have performed, for the first time, plasma PIC calculations with over one trillion particles, >100× larger than calculations considered "heroic" by community standards. This allows examination of physics at unprecedented scale and fidelity. Roadrunner is an example of an emerging paradigm in supercomputing: the trend toward multi-core systems with deep hierarchies and where memory bandwidth optimization is vital to achieving high performance. Getting VPIC to perform well on such systems is a formidable challenge: the core algorithm is memory bandwidth limited with low compute-to-data ratio and requires random access to memory in its inner loop. That we were able to get VPIC to perform and scale well, achieving >0.374 Pflop/s and linear weak scaling on real physics problems on up to the full 12240-core Roadrunner machine, bodes well for harnessing these machines for our community's needs in the future. Many of the design considerations encountered commute to other multi-core and accelerated (e.g., via GPU) platforms and we modified VPIC with flexibility in mind. These will be summarized and strategies for how one might adapt a code for such platforms will be shared. Work performed under the auspices of the U.S. DOE by the LANS LLC Los Alamos National Laboratory. Dr. Bowers is a LANL Guest Scientist; he is presently at D. E. Shaw Research LLC, 120 W 45th Street, 39th Floor, New York, NY 10036.
VIRTUS: a multi-processor system in FASTBUS

International Nuclear Information System (INIS)

Ellett, J.; Jackson, R.; Ritter, R.; Schlein, P.; Yaeger, D.; Zweizig, J.

1986-01-01

VIRTUS is a system of parallel MC68000-based processors interconnected by FASTBUS that is used either on-line as an intelligent trigger component or off-line for full event processing. Each processor receives the complete set of data from one event. The host computer, a VAX 11/780, down-line loads all software to the processors, controls and monitors the functioning of all processors, and writes processed data to tape. Instructions, programs, and data are transferred among the processors and the host in the form of fixed format, variable length data blocks. (Auth.)
An Adaptive and Integrated Low-Power Framework for Multicore Mobile Computing

Directory of Open Access Journals (Sweden)

Jongmoo Choi

2017-01-01

Full Text Available Employing multicore in mobile computing such as smartphone and IoT (Internet of Things device is a double-edged sword. It provides ample computing capabilities required in recent intelligent mobile services including voice recognition, image processing, big data analysis, and deep learning. However, it requires a great deal of power consumption, which causes creating a thermal hot spot and putting pressure on the energy resource in a mobile device. In this paper, we propose a novel framework that integrates two well-known low-power techniques, DPM (Dynamic Power Management and DVFS (Dynamic Voltage and Frequency Scaling for energy efficiency in multicore mobile systems. The key feature of the proposed framework is adaptability. By monitoring the online resource usage such as CPU utilization and power consumption, the framework can orchestrate diverse DPM and DVFS policies according to workload characteristics. Real implementation based experiments using three mobile devices have shown that it can reduce the power consumption ranging from 22% to 79%, while affecting negligibly the performance of workloads.
Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

KAUST Repository

Ltaief, Hatem; Gratadour, Damien; Charara, Ali; Gendron, Eric

2016-01-01

We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used
Multi-core job submission and grid resource scheduling for ATLAS AthenaMP

CERN Document Server

Crooks, D; The ATLAS collaboration; Harrington, R; Purdie, S; Severini, H; Skipsey, S; Tsulaia, V; Washbrook, A

2012-01-01

AthenaMP is the multi-core implementation of the ATLAS software framework and allows the efficient sharing of memory pages between multiple threads of execution. This has now been validated for production and delivers a significant reduction on overall memory footprint with negligible CPU overhead. Before AthenaMP can be routinely run on the LHC Computing Grid, it must be determined how the computing resources available to ATLAS can best exploit the notable improvements delivered by switching to this multi-process model. In particular, there is a need to identify and assess the potential impact of scheduling issues where single core and multi-core job queues have access to the same underlying resources. A study into the effectiveness and scalability of AthenaMP in a production environment will be presented. Submitting AthenaMP tasks to the Tier-0 and candidate Tier-2 sites will allow detailed measurement of worker node performance and also highlight the relative performance of local resource management system...
Spectral efficiency in crosstalk-impaired multi-core fiber links

Science.gov (United States)

Luís, Ruben S.; Puttnam, Benjamin J.; Rademacher, Georg; Klaus, Werner; Agrell, Erik; Awaji, Yoshinari; Wada, Naoya

2018-02-01

We review the latest advances on ultra-high throughput transmission using crosstalk-limited single-mode multicore fibers and compare these with the theoretical spectral efficiency of such systems. We relate the crosstalkimposed spectral efficiency limits with fiber parameters, such as core diameter, core pitch, and trench design. Furthermore, we investigate the potential of techniques such as direction interleaving and high-order MIMO to improve the throughput or reach of these systems when using various modulation formats.
On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

Directory of Open Access Journals (Sweden)

Marco Aldinucci

2014-01-01

Full Text Available The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.
On designing multicore-aware simulators for systems biology endowed with OnLine statistics.

Science.gov (United States)

Aldinucci, Marco; Calcagno, Cristina; Coppo, Mario; Damiani, Ferruccio; Drocco, Maurizio; Sciacca, Eva; Spinella, Salvatore; Torquati, Massimo; Troina, Angelo

2014-01-01

The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.
MulticoreBSP for C : A high-performance library for shared-memory parallel programming

NARCIS (Netherlands)

Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.

2014-01-01

The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the
Real-time tracking with a 3D-flow processor array

International Nuclear Information System (INIS)

Crosetto, D.

1993-01-01

The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was though to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc. with respect to the CAM approach. This report describes real-time track finding using a new computing approach technique based on the 3D-flow array processor system. This system consists of a fixed interconnection architexture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project

Real-time tracking with a 3D-Flow processor array

International Nuclear Information System (INIS)

Crosetto, D.

1993-06-01

The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was thought to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc., with respect to the CAM approach. The report describes real-time track finding using new computing approach technique based on the 3D-Flow array processor system. This system consists of a fixed interconnection architecture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project
Frequency spectrum analysis of 252Cf neutron source based on LabVIEW

International Nuclear Information System (INIS)

Mi Deling; Li Pengcheng

2011-01-01

The frequency spectrum analysis of 252 Cf Neutron source is an extremely important method in nuclear stochastic signal processing. Focused on the special '0' and '1' structure of neutron pulse series, this paper proposes a fast-correlation algorithm to improve the computational rate of the spectrum analysis system. And the multi-core processor technology is employed as well as multi-threaded programming techniques of LabVIEW to construct frequency spectrum analysis system of 252 Cf neutron source based on LabVIEW. It not only obtains the auto-correlation and cross correlation results, but also auto-power spectrum,cross-power spectrum and ratio of spectral density. The results show that: analysis tools based on LabVIEW improve the fast auto-correlation and cross correlation code operating efficiency about by 25% to 35%, also verify the feasibility of using LabVIEW for spectrum analysis. (authors)
Sensitometric control of roentgen film processors

International Nuclear Information System (INIS)

Forsberg, H.; Karolinska Sjukhuset, Stockholm

1987-01-01

Monitoring of film processors performance is essential since image quality, patient dose and costs are influenced by the performance. A system for sensitometric constancy control of film processors and their associated components is described. Experience with the system for 3 years is given when implemented on 17 film processors. Modern high quality film processors have a stability that makes a test frequency of once a week sufficient to maintain adequate image quality. The test system is so sensitive that corrective actions almost invariably have been taken before any technical problem degraded the image quality to a visible degree. (orig.)
The Central Trigger Processor (CTP)

CERN Multimedia

Franchini, Matteo

2016-01-01

The Central Trigger Processor (CTP) receives trigger information from the calorimeter and muon trigger processors, as well as from other sources of trigger. It makes the Level-1 decision (L1A) based on a trigger menu.
Very Long Instruction Word Processors

Indian Academy of Sciences (India)

Pentium Processor have modified the processor architecture to exploit parallelism in a program. .... The type of operation itself is encoded using 14 bits. .... text of designing simple architectures with low power consump- tion and execute x86 ...
The parallel algorithm for the 2D discrete wavelet transform

Science.gov (United States)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Realization Of Algebraic Processor For XML Documents Processing

International Nuclear Information System (INIS)

Georgiev, Bozhidar; Georgieva, Adriana

2010-01-01

In this paper, are presented some possibilities concerning the implementation of an algebraic method for XML hierarchical data processing which makes faster the XML search mechanism. Here is offered a different point of view for creation of advanced algebraic processor (with all necessary software tools and programming modules respectively). Therefore, this nontraditional approach for fast XML navigation with the presented algebraic processor may help to build an easier user-friendly interface provided XML transformations, which can avoid the difficulties in the complicated language constructions of XSL, XSLT and XPath. This approach allows comparatively simple search of XML hierarchical data by means of the following types of functions: specification functions and so named build-in functions. The choice of programming language Java may appear strange at first, but it isn't when you consider that the applications can run on different kinds of computers. The specific search mechanism based on the linear algebra theory is faster in comparison with MSXML parsers (on the basis of the developed examples with about 30%). Actually, there exists the possibility for creating new software tools based on the linear algebra theory, which cover the whole navigation and search techniques characterizing XSLT/XPath. The proposed method is able to replace more complicated operations in other SOA components.
Photonic bandgap fiber lasers and multicore fiber lasers for next generation high power lasers

DEFF Research Database (Denmark)

Shirakawa, A.; Chen, M.; Suzuki, Y.

2014-01-01

Photonic bandgap fiber lasers are realizing new laser spectra and nonlinearity mitigation that a conventional fiber laser cannot. Multicore fiber lasers are a promising tool for power scaling by coherent beam combination. © 2014 OSA....
Performance enhancement of multi-core fiber transmission using real-time FPGA based pre-emphasis

NARCIS (Netherlands)

Hasanuzzaman, G. K.M.; Spolitis, S.; Salgals, T.; Braunfelds, J.; Morales, A.; Gonzalez, L. E.; Rommel, S.; Puerta, R.; Asensio, P.; Bobrovs, V.; Iezekiel, S.; Tafur Monroy, I.

2017-01-01

We experimentally demonstrate pre-emphasis based performance for a 2 km long 7-core multicore fiber link. Simultaneous transmission below the FEC threshold is achievable for all cores by using signal equalization in a FPGA.
Magnetophoretic separation ICP-MS immunoassay using Cs-doped multicore magnetic nanoparticles for the determination of salmonella typhimurium.

Science.gov (United States)

Jeong, Arong; Lim, H B

2018-02-01

In this work, a magnetophoretic separation ICP-MS immunoassay using newly synthesized multicore magnetic nanoparticles (MMNPs) was developed for the determination of salmonella typhimurium (typhi). The uniqueness of this method was the use of MMNPs doped with Cs for both separation and detection, which enable us to achieve fast analysis, high sensitivity, and good reliability. For demonstration, heat-killed typhi in a phosphate buffer solution was determined by ICP-MS after the MMNP-typhi reaction product was separated from unreacted MMNPs in a micropipette tip filled with 25% polyethylene glycol through magnetophoretic separation. The calibration curve obtained by plotting 133 Cs intensity vs. the number of synthetic standard, showed a coefficient of determination (R 2 ) of 0.94 with a limit of detection (LOD) of 102 cells/mL without cell culturing. Excellent recoveries, between 98-100%, were obtained from four replicates and compared with a sandwich-type ICP-MS immunoassay for further confirmation. Copyright © 2017 Elsevier B.V. All rights reserved.
Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core

OpenAIRE

Marco Antonio Zanata Alves

2009-01-01

No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado a...
Experimental testing of the noise-canceling processor.

Science.gov (United States)

Collins, Michael D; Baer, Ralph N; Simpson, Harry J

2011-09-01

Signal-processing techniques for localizing an acoustic source buried in noise are tested in a tank experiment. Noise is generated using a discrete source, a bubble generator, and a sprinkler. The experiment has essential elements of a realistic scenario in matched-field processing, including complex source and noise time series in a waveguide with water, sediment, and multipath propagation. The noise-canceling processor is found to outperform the Bartlett processor and provide the correct source range for signal-to-noise ratios below -10 dB. The multivalued Bartlett processor is found to outperform the Bartlett processor but not the noise-canceling processor. © 2011 Acoustical Society of America
Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

Directory of Open Access Journals (Sweden)

Y. He

2008-05-01

Full Text Available Today Discrete Fourier Transforms (DFTs are applied in various radio standards based on OFDM (Orthogonal Frequency Division Multiplex. It is important to gain a fast computational speed for the DFT, which is usually achieved by using specialized Fast Fourier Transform (FFT engines. However, in face of the Software Defined Radio (SDR development, more general (parallel processor architectures are often desirable, which are not tailored to FFT computations. Therefore, alternative approaches are required to reduce the complexity of the DFT. Starting from a matrix-vector based description of the FFT idea, we will present different factorizations of the DFT matrix, which allow a reduction of the complexity that lies between the original DFT and the minimum FFT complexity. The computational complexities of these factorizations and their suitability for implementation on different processor architectures are investigated.
Debugging multi-core systems-on-chip

NARCIS (Netherlands)

Vermeulen, B.; Goossens, K.G.W.; Kornaros, G.

2010-01-01

In this chapter, we introduced three fundamental reasons why debugging a multi-processor SoC is intrinsically difficult; (1) limited internal observability, (2) asynchronicity, and (3) non-determinism.
Evaluating Food Safety Knowledge and Practices of Food Processors and Sellers Working in Food Facilities in Hanoi, Vietnam.

Science.gov (United States)

Tran, Bach Xuan; DO, Hoa Thi; Nguyen, Luong Thanh; Boggiano, Victoria; LE, Huong Thi; LE, Xuan Thanh Thi; Trinh, Ngoc Bao; DO, Khanh Nam; Nguyen, Cuong Tat; Nguyen, Thanh Trung; Dang, Anh Kim; Mai, Hue Thi; Nguyen, Long Hoang; Than, Selena; Latkin, Carl A

2018-04-01

Consumption of fast food and street food is increasingly common among Vietnamese, particularly in large cities. The high daily demand for these convenient food services, together with a poor management system, has raised concerns about food hygiene and safety (FHS). This study aimed to examine the FHS knowledge and practices of food processors and sellers in food facilities in Hanoi, Vietnam, and to identify their associated factors. A cross-sectional study was conducted with 1,760 food processors and sellers in restaurants, fast food stores, food stalls, and street vendors in Hanoi in 2015. We assessed each participant's FHS knowledge using a self-report questionnaire and their FHS practices using a checklist. Tobit regression was used to determine potential factors associated with FHS knowledge and practices, including demographics, training experience, and frequency of health examination. Overall, we observed a lack of FHS knowledge among respondents across three domains, including standard requirements for food facilities (18%), food processing procedures (29%), and food poisoning prevention (11%). Only 25.9 and 38.1% of participants used caps and masks, respectively, and 12.8% of food processors reported direct hand contact with food. After adjusting for socioeconomic characteristics, these factors significantly predicted increased FHS knowledge and practice scores: (i) working at restaurants and food stalls, (ii) having FHS training, (iii) having had a physical examination, and (iv) having taken a stool test within the last year. These findings highlight the need of continuous training to improve FHS knowledge and practices among food processors and food sellers. Moreover, regular monitoring of food facilities, combined with medical examination of their staff, should be performed to ensure food safety.
Towards spatial isolation design in a multi-core real-time kernel targeting safety-critical applications

DEFF Research Database (Denmark)

Li, Gang; Top, Søren

2013-01-01

. Partitioning can prevent fault propagation among mixed-criticality applications, if spatial and temporal isolation are adequately ensured. This paper focuses on the solution of spatial isolation in the HARTEX kernel on a multi-core platform in terms of memory, communication between applications and I/O sharing....... According to formulated isolation requirements, a simple partitioning multi-core hardware architecture is proposed using SoC and memory protection units, and the kernel is extended to support spatial isolation between the kernel and applications as well as between applications. Combined design of hardware...... and software can easily achieve this isolation. At last, the spatial isolation is evaluated using a statistical sampling method and its performance is tested in terms of task switch, system call and footprint....
Electrosprayed Multi-Core Alginate Microcapsules as Novel Self-Healing Containers.

Science.gov (United States)

Hia, Iee Lee; Pasbakhsh, Pooria; Chan, Eng-Seng; Chai, Siang-Piao

2016-10-03

Alginate microcapsules containing epoxy resin were developed through electrospraying method and embedded into epoxy matrix to produce a capsule-based self-healing composite system. These formaldehyde free alginate/epoxy microcapsules were characterized via light microscope, field emission scanning electron microscope, fourier transform infrared spectroscopy and thermogravimetric analysis. Results showed that epoxy resin was successfully encapsulated within alginate matrix to form porous (multi-core) microcapsules with pore size ranged from 5-100 μm. The microcapsules had an average size of 320 ± 20 μm with decomposition temperature at 220 °C. The loading capacity of these capsules was estimated to be 79%. Under in situ healing test, impact specimens showed healing efficiency as high as 86% and the ability to heal up to 3 times due to the multi-core capsule structure and the high impact energy test that triggered the released of epoxy especially in the second and third healings. TDCB specimens showed one-time healing only with the highest healing efficiency of 76%. The single healing event was attributed by the constant crack propagation rate of TDCB fracture test. For the first time, a cost effective, environmentally benign and sustainable capsule-based self-healing system with multiple healing capabilities and high healing performance was developed.
Design of Multi-core Fiber Patch Panel for Space Division Multiplexing Implementations

DEFF Research Database (Denmark)

Gonzalez, Luz E.; Morales, Alvaro; Rommel, Simon

2018-01-01

A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 d...
Development of a highly reliable CRT processor

International Nuclear Information System (INIS)

Shimizu, Tomoya; Saiki, Akira; Hirai, Kenji; Jota, Masayoshi; Fujii, Mikiya

1996-01-01

Although CRT processors have been employed by the main control board to reduce the operator's workload during monitoring, the control systems are still operated by hardware switches. For further advancement, direct controller operation through a display device is expected. A CRT processor providing direct controller operation must be as reliable as the hardware switches are. The authors are developing a new type of highly reliable CRT processor that enables direct controller operations. In this paper, we discuss the design principles behind a highly reliable CRT processor. The principles are defined by studies of software reliability and of the functional reliability of the monitoring and operation systems. The functional configuration of an advanced CRT processor is also addressed. (author)
Computer Generated Inputs for NMIS Processor Verification

International Nuclear Information System (INIS)

J. A. Mullens; J. E. Breeding; J. A. McEvers; R. W. Wysor; L. G. Chiang; J. R. Lenarduzzi; J. T. Mihalczo; J. K. Mattingly

2001-01-01

Proper operation of the Nuclear Identification Materials System (NMIS) processor can be verified using computer-generated inputs [BIST (Built-In-Self-Test)] at the digital inputs. Preselected sequences of input pulses to all channels with known correlation functions are compared to the output of the processor. These types of verifications have been utilized in NMIS type correlation processors at the Oak Ridge National Laboratory since 1984. The use of this test confirmed a malfunction in a NMIS processor at the All-Russian Scientific Research Institute of Experimental Physics (VNIIEF) in 1998. The NMIS processor boards were returned to the U.S. for repair and subsequently used in NMIS passive and active measurements with Pu at VNIIEF in 1999

The Associative Memory Boards for the FTK Processor at ATLAS

CERN Document Server

Calabro, D; The ATLAS collaboration; Citraro, S; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

2013-01-01

The Associative Memory (AM) system, the main part of the FastTracker (FTK) processor, is designed to perform pattern matching using the information of the silicon tracking detectors. It finds track candidates at low resolution that are seeds for the following step performing precise track fitting. The system has to support challenging data traffic, handled by a group of modern low cost FPGAs, the Xilinx Spartan6 chips, which have Low-Power Gigabit Transceivers (GTP). Each GTP transceiver is a combined transmitter and receiver capable of operating at data rates up to 3.2 Gb/s. \
3081//sub E/ processor

International Nuclear Information System (INIS)

Kunz, P.F.; Gravina, M.; Oxoby, G.; Trang, Q.; Fucci, A.; Jacobs, D.; Martin, B.; Storr, K.

1983-03-01

Since the introduction of the 168//sub E/, emulating processors have been successful over an amazingly wide range of applications. This paper will describe a second generation processor, the 3081//sub E/. This new processor, which is being developed as a collaboration between SLAC and CERN, goes beyond just fixing the obvious faults of the 168//sub E/. Not only will the 3081//sub E/ have much more memory space, incorporate many more IBM instructions, and have much more memory space, incorporate many more IBM instructions, and have full double precision floating point arithmetic, but it will also have faster execution times and be much simpler to build, debug, and maintain. The simple interface and reasonable cost of the 168//sub E/ will be maintained for the 3081//sub E/
Multimode power processor

Science.gov (United States)

O'Sullivan, George A.; O'Sullivan, Joseph A.

1999-01-01

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.
Multicore optical fiber grating array fabrication for medical sensing applications

Science.gov (United States)

Westbrook, Paul S.; Feder, K. S.; Kremp, T.; Taunay, T. F.; Monberg, E.; Puc, G.; Ortiz, R.

2015-03-01

In this work we report on a fiber grating fabrication platform suitable for parallel fabrication of Bragg grating arrays over arbitrary lengths of multicore optical fiber. Our system exploits UV transparent coatings and has precision fiber translation that allows for quasi-continuous grating fabrication. Our system is capable of both uniform and chirped fiber grating array spectra that can meet the demands of medical sensors including high speed, accuracy, robustness and small form factor.
PixonVision real-time video processor

Science.gov (United States)

Puetter, R. C.; Hier, R. G.

2007-09-01

PixonImaging LLC and DigiVision, Inc. have developed a real-time video processor, the PixonVision PV-200, based on the patented Pixon method for image deblurring and denoising, and DigiVision's spatially adaptive contrast enhancement processor, the DV1000. The PV-200 can process NTSC and PAL video in real time with a latency of 1 field (1/60 th of a second), remove the effects of aerosol scattering from haze, mist, smoke, and dust, improve spatial resolution by up to 2x, decrease noise by up to 6x, and increase local contrast by up to 8x. A newer version of the processor, the PV-300, is now in prototype form and can handle high definition video. Both the PV-200 and PV-300 are FPGA-based processors, which could be spun into ASICs if desired. Obvious applications of these processors include applications in the DOD (tanks, aircraft, and ships), homeland security, intelligence, surveillance, and law enforcement. If developed into an ASIC, these processors will be suitable for a variety of portable applications, including gun sights, night vision goggles, binoculars, and guided munitions. This paper presents a variety of examples of PV-200 processing, including examples appropriate to border security, battlefield applications, port security, and surveillance from unmanned aerial vehicles.
Single-shot polarimetry imaging of multicore fiber.

Science.gov (United States)

Sivankutty, Siddharth; Andresen, Esben Ravn; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hervé

2016-05-01

We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element, together with an analyzer, and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single-mode fiber cores in both conventional and polarization-maintaining fiber bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual fiber core, including eigen-polarization states, phase delay, and diattenuation.
Processors and systems (picture processing)

Energy Technology Data Exchange (ETDEWEB)

Gemmar, P

1983-01-01

Automatic picture processing requires high performance computers and high transmission capacities in the processor units. The author examines the possibilities of operating processors in parallel in order to accelerate the processing of pictures. He therefore discusses a number of available processors and systems for picture processing and illustrates their capacities for special types of picture processing. He stresses the fact that the amount of storage required for picture processing is exceptionally high. The author concludes that it is as yet difficult to decide whether very large groups of simple processors or highly complex multiprocessor systems will provide the best solution. Both methods will be aided by the development of VLSI. New solutions have already been offered (systolic arrays and 3-d processing structures) but they also are subject to losses caused by inherently parallel algorithms. Greater efforts must be made to produce suitable software for multiprocessor systems. Some possibilities for future picture processing systems are discussed. 33 references.
Design of multi-core fiber patch panel for space division multiplexing implementations

NARCIS (Netherlands)

González, Luz E.; Morales, Alvaro; Rommel, Simon; Jørgensen, Bo F.; Porras-Montenegro, N.; Tafur Monroy, Idelfonso

2018-01-01

A multi-core fiber (MCF) patch panel was designed, allowing easy coupling of individual signals to and from a 7-core MCF. The device was characterized, measuring insertion loss and cross talk, finding highest insertion loss and lowest crosstalk at 1300 nm with values of 9.7 dB and -36.5 dB
Special processor for in-core control systems

International Nuclear Information System (INIS)

Golovanov, M.N.; Duma, V.R.; Levin, G.L.; Mel'nikov, A.V.; Polikanin, A.V.; Filatov, V.P.

1978-01-01

The BUTs-20 special processor is discussed, designed to control the units of the in-core control equipment which are incorporated into the VECTOR communication channel, and to provide preliminary data processing prior to computer calculations. A set of instructions and flowsheet of the processor, organization of its communication with memories and other units of the system are given. The processor components: a control unit and an arithmetic logical unit are discussed. It is noted that the special processor permits more effective utilization of the computer time
Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

KAUST Repository

Ltaief, Hatem

2012-01-01

The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system. © 2012 Springer-Verlag.
Evaluation of a fast PLC module in prospect of the LHC beam interlock system

CERN Document Server

Zaera-Sanz, Manuel

2005-01-01

The LHC Beam Interlock system requires a controller performing a simple matrix function to collect the different beam dump requests. To satisfy the expected safety level of the Interlock, the system should be robust and reliable. The PLC is a promising candidate to fulfil both aspects but too slow to meet the expected response time which is of the order of mseconds. Siemens has introduced a “so called” fast module (FM352-5 Boolean Processor) that provides independent and extremely fast control of a process within a larger control system using an onboard processor, a Field Programmable Gate Array (FPGA), to execute code in parallel which results in extremely fast scan times. It is interesting to investigate its features and to evaluate it as a possible candidate for the beam interlock system. This note publishes the results of this study. As well, this note could be useful for other applications requiring fast processing using a PLC.
Functional Verification of Enhanced RISC Processor

OpenAIRE

SHANKER NILANGI; SOWMYA L

2013-01-01

This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...
Effect of processor temperature on film dosimetry

International Nuclear Information System (INIS)

Srivastava, Shiv P.; Das, Indra J.

2012-01-01

Optical density (OD) of a radiographic film plays an important role in radiation dosimetry, which depends on various parameters, including beam energy, depth, field size, film batch, dose, dose rate, air film interface, postexposure processing time, and temperature of the processor. Most of these parameters have been studied for Kodak XV and extended dose range (EDR) films used in radiation oncology. There is very limited information on processor temperature, which is investigated in this study. Multiple XV and EDR films were exposed in the reference condition (d max. , 10 × 10 cm 2 , 100 cm) to a given dose. An automatic film processor (X-Omat 5000) was used for processing films. The temperature of the processor was adjusted manually with increasing temperature. At each temperature, a set of films was processed to evaluate OD at a given dose. For both films, OD is a linear function of processor temperature in the range of 29.4–40.6°C (85–105°F) for various dose ranges. The changes in processor temperature are directly related to the dose by a quadratic function. A simple linear equation is provided for the changes in OD vs. processor temperature, which could be used for correcting dose in radiation dosimetry when film is used.
Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes

International Nuclear Information System (INIS)

Tramm, J.R.; Siegel, A.R.

2013-01-01

The simulation of whole nuclear cores through the use of Monte Carlo codes requires an impracticably long time-to-solution. We have extracted a kernel that executes only the most computationally expensive steps of the Monte Carlo particle transport algorithm - the calculation of macroscopic cross sections - in an effort to expose bottlenecks within multi-core, shared memory architectures. (authors)
Bank switched memory interface for an image processor

International Nuclear Information System (INIS)

Barron, M.; Downward, J.

1980-09-01

A commercially available image processor is interfaced to a PDP-11/45 through an 8K window of memory addresses. When the image processor was not in use it was desired to be able to use the 8K address space as real memory. The standard method of accomplishing this would have been to use UNIBUS switches to switch in either the physical 8K bank of memory or the image processor memory. This method has the disadvantage of being rather expensive. As a simple alternative, a device was built to selectively enable or disable either an 8K bank of memory or the image processor memory. To enable the image processor under program control, GEN is contracted in size, the memory is disabled, a device partition for the image processor is created above GEN, and the image processor memory is enabled. The process is reversed to restore memory to GEN. The hardware to enable/disable the image and computer memories is controlled using spare bits from a DR-11K output register. The image processor and physical memory can be switched in or out on line with no adverse affects on the system's operation
The CDF II eXtremely fast tracker upgrade

Energy Technology Data Exchange (ETDEWEB)

Abulencia, A.; Azzurri, P.; Cochran, E.; Dittmann, J.; Donati, S.; Efron, J.; Erbacher, R.; Errede, D.; Fedorko, I.; Flanagan, G.; Forrest, R.; /Illinois U., Urbana

2006-09-01

The CDF II Extremely Fast Tracker is the trigger track processor which reconstructs charged particle tracks in the transverse plane of the CDF II central outer tracking chamber. The system is now being upgraded to perform a three dimensional track reconstruction. A review of the upgrade is presented here.
VON WISPR Family Processors: Volume 1

National Research Council Canada - National Science Library

Wagstaff, Ronald

1997-01-01

...) and the background noise they are embedded in. Processors utilizing those fluctuations such as the von WISPR Family Processors discussed herein, are methods or algorithms that preferentially attenuate the fluctuating signals and noise...
The KFParticle package for the fast particle reconstruction in ALICE and CBM

Energy Technology Data Exchange (ETDEWEB)

Zyzak, Maksym; Kisel, Ivan; Kulakov, Igor [Goethe Univ. Frankfurt am Main (Germany); Frankfurt Institute for Advanced Studies, Frankfurt am Main (Germany); GSI Helmholtzzentrum fuer Schwerionenforschung GmbH, Darmstadt (Germany); Vassiliev, Iourii [Goethe Univ. Frankfurt am Main (Germany); GSI Helmholtzzentrum fuer Schwerionenforschung GmbH, Darmstadt (Germany); Collaboration: CBM-Collaboration

2013-07-01

Modern heavy-ion experiments operate with very high data rates and track multiplicities collecting petabytes of data, therefore the speed of the reconstruction algorithms is crucial both for the online and offline data analysis. The KFParticle package for short-lived particles reconstruction has been developed and is actively used both in the CBM and ALICE experiments. The package is based on the Kalman filter mathematics and has rich functionality. It is geometry independent and can be used in other experiments too. Almost all modern servers are equipped with many or multi-core processors, which contain SIMD modules. The KFParticle has been SIMDized, which gives the additional speedup factor of 3-5. KFParticle allows to reconstruct about 50 decay channels achieving speed of 1.5 ms per Au+Au mbias collisions at 25 AGeV on a single core. The package has been parallelized between cores and shows strong linear scalability on servers with up to 80 logical cores.
Resource sharing under global scheduling with partial processor bandwidth

NARCIS (Netherlands)

Afshar, Sara; Behnam, Moris; Bril, Reinder J.; Nolte, Thomas

2015-01-01

Resource efficient approaches are of great importance for resource constrained embedded systems. In this paper, we present an approach targeting systems where tasks of a critical application are partitioned on a multi-core platform and by using resource reservation techniques, the remaining
32-core Inline Multicore Fiber Amplifier for Dense Space Division Multiplexed Transmission Systems

DEFF Research Database (Denmark)

Jain, S.; Mizuno, T.; Jung, Y.

2016-01-01

We present a high-core-count SDM amplifier, i.e. 32-core multicore-fiber amplifier, in a cladding-pumped configuration. An average gain of 17dB and NF of 7dB is obtained for -5dBm input signal power in the wavelength range 1544nm-1564nm....

Many - body simulations using an array processor

International Nuclear Information System (INIS)

Rapaport, D.C.

1985-01-01

Simulations of microscopic models of water and polypeptides using molecular dynamics and Monte Carlo techniques have been carried out with the aid of an FPS array processor. The computational techniques are discussed, with emphasis on the development and optimization of the software to take account of the special features of the processor. The computing requirements of these simulations exceed what could be reasonably carried out on a normal 'scientific' computer. While the FPS processor is highly suited to the kinds of models described, several other computationally intensive problems in statistical mechanics are outlined for which alternative processor architectures are more appropriate
Quality-Driven Model-Based Design of MultiProcessor Embedded Systems for Highlydemanding Applications

DEFF Research Database (Denmark)

Jozwiak, Lech; Madsen, Jan

2013-01-01

The recent spectacular progress in modern nano-dimension semiconductor technology enabled implementation of a complete complex multi-processor system on a single chip (MPSoC), global networking and mobile wire-less communication, and facilitated a fast progress in these areas. New important...... accessible or distant) objects, installations, machines or devices, or even implanted in human or animal body can serve as examples. However, many of the modern embedded application impose very stringent functional and parametric demands. Moreover, the spectacular advances in microelectronics introduced...
Overview of Parallel Platforms for Common High Performance Computing

Directory of Open Access Journals (Sweden)

T. Fryza

2012-04-01

Full Text Available The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well.
Multi-processor network implementations in Multibus II and VME

International Nuclear Information System (INIS)

Briegel, C.

1992-01-01

ACNET (Fermilab Accelerator Controls Network), a proprietary network protocol, is implemented in a multi-processor configuration for both Multibus II and VME. The implementations are contrasted by the bus protocol and software design goals. The Multibus II implementation provides for multiple processors running a duplicate set of tasks on each processor. For a network connected task, messages are distributed by a network round-robin scheduler. Further, messages can be stopped, continued, or re-routed for each task by user-callable commands. The VME implementation provides for multiple processors running one task across all processors. The process can either be fixed to a particular processor or dynamically allocated to an available processor depending on the scheduling algorithm of the multi-processing operating system. (author)
Evaluation and application of a fast module in a PLC based interlock and control system

International Nuclear Information System (INIS)

Zaera-Sanz, M

2009-01-01

The LHC Beam Interlock system requires a controller performing a simple matrix function to collect the different beam dump requests. To satisfy the expected safety level of the Interlock, the system should be robust and reliable. The PLC is a promising candidate to fulfil both aspects but too slow to meet the expected response time which is of the order of μseconds. Siemens has introduced a 'so called' fast module (FM352-5 Boolean Processor). It provides independent and extremely fast control of a process within a larger control system using an onboard processor, a Field Programmable Gate Array (FPGA), to execute code in parallel which results in extremely fast scan times. It is interesting to investigate its features and to evaluate it as a possible candidate for the beam interlock system. This paper publishes the results of this study. As well, this paper could be useful for other applications requiring fast processing using a PLC.
Bellcord: a multilevel fast preprocessor for 1024 ECL channels

International Nuclear Information System (INIS)

Kerns, C.R.

1978-01-01

To provide a fast decision trigger on multiple tracks passing through multiwire proportional chambers, a high-speed (arriving at an answer in 60 ns) track counting system was developed at Fermilab. The circuit is capable of selecting the track multiplicities utilizing a coaxial cable ''Bus'' (the Bellcord) on which fast pulses are summed. Up to 16 Bellcord coax cables, each having 64 inputs, are fanned into a central ''Hub'' processor where the trigger level decision is made. 7 figures
Evaluation of the Intel Sandy Bridge-EP server processor

CERN Document Server

Jarp, S; Leduc, J; Nowak, A; CERN. Geneva. IT Department

2012-01-01

In this paper we report on a set of benchmark results recently obtained by CERN openlab when comparing an 8-core “Sandy Bridge-EP” processor with Intel’s previous microarchitecture, the “Westmere-EP”. The Intel marketing names for these processors are “Xeon E5-2600 processor series” and “Xeon 5600 processor series”, respectively. Both processors are produced in a 32nm process, and both platforms are dual-socket servers. Multiple benchmarks were used to get a good understanding of the performance of the new processor. We used both industry-standard benchmarks, such as SPEC2006, and specific High Energy Physics benchmarks, representing both simulation of physics detectors and data analysis of physics events. Before summarizing the results we must stress the fact that benchmarking of modern processors is a very complex affair. One has to control (at least) the following features: processor frequency, overclocking via Turbo mode, the number of physical cores in use, the use of logical cores ...
Array processors based on Gaussian fraction-free method

Energy Technology Data Exchange (ETDEWEB)

Peng, S; Sedukhin, S [Aizu Univ., Aizuwakamatsu, Fukushima (Japan); Sedukhin, I

1998-03-01

The design of algorithmic array processors for solving linear systems of equations using fraction-free Gaussian elimination method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that some array processors are optimal in the framework of linear allocation of computations and in terms of number of processing elements and computing time. (author)
Application of digital beam position processor Libera on tune measurement

International Nuclear Information System (INIS)

Zhang Chunhui; Sun Baogen; Cao Yong; Lu Ping; Li Jihao

2006-01-01

Digital signal processing (DSP) is widely used in the field of beam diagnostics. Especially, DSP achieves very good performance in beam position signal analysis and betatron tune measurement. In Hefei light source, when beam was excited by narrow-band Gaussian white nose, Libera, a digital beam position processor, was used to process the signals from beam position monitor (BPM), which contained betatron oscillation. Fast Fourier transform (FFT) was applied to finding out betatron resonance frequency, from which the decimal part of betatron oscillation tune was calculated. By this means, the measure of horizontal tune was 3.5352 and the measure of vertical tune is 2.6299. (authors)
Scalable force directed graph layout algorithms using fast multipole methods

KAUST Repository

Yunis, Enas Abdulrahman

2012-06-01

We present an extension to ExaFMM, a Fast Multipole Method library, as a generalized approach for fast and scalable execution of the Force-Directed Graph Layout algorithm. The Force-Directed Graph Layout algorithm is a physics-based approach to graph layout that treats the vertices V as repelling charged particles with the edges E connecting them acting as springs. Traditionally, the amount of work required in applying the Force-Directed Graph Layout algorithm is O(|V|2 + |E|) using direct calculations and O(|V| log |V| + |E|) using truncation, filtering, and/or multi-level techniques. Correct application of the Fast Multipole Method allows us to maintain a lower complexity of O(|V| + |E|) while regaining most of the precision lost in other techniques. Solving layout problems for truly large graphs with millions of vertices still requires a scalable algorithm and implementation. We have been able to leverage the scalability and architectural adaptability of the ExaFMM library to create a Force-Directed Graph Layout implementation that runs efficiently on distributed multicore and multi-GPU architectures. © 2012 IEEE.
Multiple Embedded Processors for Fault-Tolerant Computing

Science.gov (United States)

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Design Principles for Synthesizable Processor Cores

DEFF Research Database (Denmark)

Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

2012-01-01

As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...
Data register and processor for multiwire chambers

International Nuclear Information System (INIS)

Karpukhin, V.V.

1985-01-01

A data register and a processor for data receiving and processing from drift chambers of a device for investigating relativistic positroniums are described. The data are delivered to the register input in the form of the Grey 8 bit code, memorized and transformed to a position code. The register information is delivered to the KAMAK trunk and to the front panel plug. The processor selects particle tracks in a horizontal plane of the facility. ΔY maximum coordinate divergence and minimum point quantity on the track are set from the processor front panel. Processor solution time is 16 μs maximum quantity of simultaneously analyzed coordinates is 16
Application of an array processor to the analysis of magnetic data for the Doublet III tokamak

International Nuclear Information System (INIS)

Wang, T.S.; Saito, M.T.

1980-08-01

Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed
Towards the petaflop for Lattice QCD simulations the PetaQCD project

International Nuclear Information System (INIS)

D'Auriac, Jean-Christian Angles; Carbonell, Jaume; Barthou, Denis

2010-01-01

The study and design of a very ambitious petaflop cluster exclusively dedicated to Lattice QCD simulations started in early '08 among a consortium of 7 laboratories (IN2P3, CNRS, INRIA, CEA) and 2 SMEs. This consortium received a grant from the French ANR agency in July '08, and the PetaQCD project kickoff took place in January '09. Building upon several years of fruitful collaborative studies in this area, the aim of this project is to demonstrate that the simulation of a 256 x 128 3 lattice can be achieved through the HMC/ETMC software, using a machine with efficient speed/cost/reliability/power consumption ratios. It is expected that this machine can be built out of a rather limited number of processors (e.g. between 1000 and 4000), although capable of a sustained petaflop CPU performance. The proof-of-concept should be a mock-up cluster built as much as possible with off-the-shelf components, and 2 particularly attractive axis will be mainly investigated, in addition to fast all-purpose multi-core processors: the use of the new brand of IBM-Cell processors (with on-chip accelerators) and the very recent Nvidia GP-GPUs (off-chip co-processors). This cluster will obviously be massively parallel, and heterogeneous. Communication issues between processors, implied by the Physics of the simulation and the lattice partitioning, will certainly be a major key to the project.
Development methods for VLSI-processors

International Nuclear Information System (INIS)

Horninger, K.; Sandweg, G.

1982-01-01

The aim of this project, which was originally planed for 3 years, was the development of modern system and circuit concepts, for VLSI-processors having a 32 bit wide data path. The result of this first years work is the concept of a general purpose processor. This processor is not only logically but also physically (on the chip) divided into four functional units: a microprogrammable instruction unit, an execution unit in slice technique, a fully associative cache memory and an I/O unit. For the ALU of the execution unit circuits in PLA and slice techniques have been realized. On the basis of regularity, area consumption and achievable performance the slice technique has been prefered. The designs utilize selftesting circuitry. (orig.) [de
The performance of an LSI-11/23 with a SKYMNK-Q array processor as a high speed front end processor

International Nuclear Information System (INIS)

Clark, D.L.

1983-01-01

The NSRL has recently installed a VAX-11/750 based data acquisition system which is networked to two LSI-11/23 satellite processors. Each of the LSI's are connected to CAMAC branch drivers. The LSI's have small array processors installed for use in preprocessing data. The objective is to provide an easy to use high speed processor that will relieve the VAX of some of the real-time data analysis tasks. The basic operation of the array processor and some of the results of performance tests are described
Embedded Processor Laboratory

Data.gov (United States)

Federal Laboratory Consortium — The Embedded Processor Laboratory provides the means to design, develop, fabricate, and test embedded computers for missile guidance electronics systems in support...
Architectural design and analysis of a programmable image processor

International Nuclear Information System (INIS)

Siyal, M.Y.; Chowdhry, B.S.; Rajput, A.Q.K.

2003-01-01

In this paper we present an architectural design and analysis of a programmable image processor, nicknamed Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. (author)
Analytical Bounds on the Threads in IXP1200 Network Processor

OpenAIRE

Ramakrishna, STGS; Jamadagni, HS

2003-01-01

Increasing link speeds have placed enormous burden on the processing requirements and the processors are expected to carry out a variety of tasks. Network Processors (NP) [1] [2] is the blanket name given to the processors, which are traded for flexibility and performance. Network Processors are offered by a number of vendors; to take the main burden of processing requirement of network related operations from the conventional processors. The Network Processors cover a spectrum of design trad...

A MULTICORE COMPUTER SYSTEM FOR DESIGN OF STREAM CIPHERS BASED ON RANDOM FEEDBACK

Directory of Open Access Journals (Sweden)

Borislav BEDZHEV

2013-01-01

Full Text Available The stream ciphers are an important tool for providing information security in the present communication and computer networks. Due to this reason our paper describes a multicore computer system for design of stream ciphers based on the so - named random feedback shift registers (RFSRs. The interest to this theme is inspired by the following facts. First, the RFSRs are a relatively new type of stream ciphers which demonstrate a significant enhancement of the crypto - resistance in a comparison with the classical stream ciphers. Second, the studding of the features of the RFSRs is in very initial stage. Third, the theory of the RFSRs seems to be very hard, which leads to the necessity RFSRs to be explored mainly by the means of computer models. The paper is organized as follows. First, the basics of the RFSRs are recalled. After that, our multicore computer system for design of stream ciphers based on RFSRs is presented. Finally, the advantages and possible areas of application of the computer system are discussed.
The design of multi-core DSP parallel model based on message passing and multi-level pipeline

Science.gov (United States)

Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong

2017-10-01

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Experimental investigation of inter-core crosstalk tolerance of MIMO-OFDM/OQAM radio over multicore fiber system.

Science.gov (United States)

He, Jiale; Li, Borui; Deng, Lei; Tang, Ming; Gan, Lin; Fu, Songnian; Shum, Perry Ping; Liu, Deming

2016-06-13

In this paper, the feasibility of space division multiplexing for optical wireless fronthaul systems is experimentally demonstrated by implementing high speed MIMO-OFDM/OQAM radio signals over 20km 7-core fiber and 0.4m wireless link. Moreover, the impact of optical inter-core crosstalk in multicore fibers on the proposed MIMO-OFDM/OQAM radio over fiber system is experimentally evaluated in both SISO and MIMO configurations for comparison. The experimental results show that the inter-core crosstalk tolerance of the proposed radio over fiber system can be relaxed to -10 dB by using the proposed MIMO-OFDM/OQAM processing. These results could guide high density multicore fiber design to support a large number of antenna modules and a higher density of radio-access points for potential applications in 5G cellular system.
A UNIX-based prototype biomedical virtual image processor

International Nuclear Information System (INIS)

Fahy, J.B.; Kim, Y.

1987-01-01

The authors have developed a multiprocess virtual image processor for the IBM PC/AT, in order to maximize image processing software portability for biomedical applications. An interprocess communication scheme, based on two-way metacode exchange, has been developed and verified for this purpose. Application programs call a device-independent image processing library, which transfers commands over a shared data bridge to one or more Autonomous Virtual Image Processors (AVIP). Each AVIP runs as a separate process in the UNIX operating system, and implements the device-independent functions on the image processor to which it corresponds. Application programs can control multiple image processors at a time, change the image processor configuration used at any time, and are completely portable among image processors for which an AVIP has been implemented. Run-time speeds have been found to be acceptable for higher level functions, although rather slow for lower level functions, owing to the overhead associated with sending commands and data over the shared data bridge
The communication processor of TUMULT-64

NARCIS (Netherlands)

Smit, Gerardus Johannes Maria; Jansen, P.G.

1988-01-01

Tumult (Twente University MULTi-processor system) is a modular extendible multi-processor system designed and implemented at the Twente University of Technology in co-operation with Oce Nederland B.V. and the Dr. Neher Laboratories (Dutch PTT). Characteristics of the hardware are: MIMD type,
Exploring Heterogeneous Multicore Architectures for Advanced Embedded Uncertainty Quantification.

Energy Technology Data Exchange (ETDEWEB)

Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.

2014-09-01

We explore rearrangements of classical uncertainty quantification methods with the aim of achieving higher aggregate performance for uncertainty quantification calculations on emerging multicore and manycore architectures. We show a rearrangement of the stochastic Galerkin method leads to improved performance and scalability on several computational architectures whereby un- certainty information is propagated at the lowest levels of the simulation code improving memory access patterns, exposing new dimensions of fine grained parallelism, and reducing communica- tion. We also develop a general framework for implementing such rearrangements for a diverse set of uncertainty quantification algorithms as well as computational simulation codes to which they are applied.
Acceleration of stereo-matching on multi-core CPU and GPU

OpenAIRE

Tian, Xu; Cockshott, Paul; Oehler, Susanne

2014-01-01

This paper presents an accelerated version of a\\ud dense stereo-correspondence algorithm for two different parallelism\\ud enabled architectures, multi-core CPU and GPU. The\\ud algorithm is part of the vision system developed for a binocular\\ud robot-head in the context of the CloPeMa 1 research project.\\ud This research project focuses on the conception of a new clothes\\ud folding robot with real-time and high resolution requirements\\ud for the vision system. The performance analysis shows th...
High-Speed General Purpose Genetic Algorithm Processor.

Science.gov (United States)

Hoseini Alinodehi, Seyed Pourya; Moshfe, Sajjad; Saber Zaeimian, Masoumeh; Khoei, Abdollah; Hadidi, Khairollah

2016-07-01

In this paper, an ultrafast steady-state genetic algorithm processor (GAP) is presented. Due to the heavy computational load of genetic algorithms (GAs), they usually take a long time to find optimum solutions. Hardware implementation is a significant approach to overcome the problem by speeding up the GAs procedure. Hence, we designed a digital CMOS implementation of GA in [Formula: see text] process. The proposed processor is not bounded to a specific application. Indeed, it is a general-purpose processor, which is capable of performing optimization in any possible application. Utilizing speed-boosting techniques, such as pipeline scheme, parallel coarse-grained processing, parallel fitness computation, parallel selection of parents, dual-population scheme, and support for pipelined fitness computation, the proposed processor significantly reduces the processing time. Furthermore, by relying on a built-in discard operator the proposed hardware may be used in constrained problems that are very common in control applications. In the proposed design, a large search space is achievable through the bit string length extension of individuals in the genetic population by connecting the 32-bit GAPs. In addition, the proposed processor supports parallel processing, in which the GAs procedure can be run on several connected processors simultaneously.
Accelerating the explicitly restarted Arnoldi method with GPUs using an auto-tuned matrix vector product

International Nuclear Information System (INIS)

Dubois, J.; Calvin, Ch.; Dubois, J.; Petiton, S.

2011-01-01

This paper presents a parallelized hybrid single-vector Arnoldi algorithm for computing approximations to Eigen-pairs of a nonsymmetric matrix. We are interested in the use of accelerators and multi-core units to speed up the Arnoldi process. The main goal is to propose a parallel version of the Arnoldi solver, which can efficiently use multiple multi-core processors or multiple graphics processing units (GPUs) in a mixed coarse and fine grain fashion. In the proposed algorithms, this is achieved by an auto-tuning of the matrix vector product before starting the Arnoldi Eigen-solver as well as the reorganization of the data and global communications so that communication time is reduced. The execution time, performance, and scalability are assessed with well-known dense and sparse test matrices on multiple Nehalems, GT200 NVidia Tesla, and next generation Fermi Tesla. With one processor, we see a performance speedup of 2 to 3x when using all the physical cores, and a total speedup of 2 to 8x when adding a GPU to this multi-core unit, and hence a speedup of 4 to 24x compared to the sequential solver. (authors)
VLITE-Fast: A Real-time, 350 MHz Commensal VLA Survey for Fast Transients

Science.gov (United States)

Kerr, Matthew; Ray, Paul S.; Kassim, Namir E.; Clarke, Tracy; Deneva, Julia; Polisensky, Emil

2018-01-01

The VLITE (VLA Low Band Ionosphere and Transient Experiment; http://vlite.nrao.edu) program operates commensally during all Very Large Array observations, collecting data from 320 to 384 MHz. Recently expanded to include 16 antennas, the large field of view and huge time on sky offer good coverage of the transient, low-frequency sky. We describe the VLITE-Fast system, a GPU-based signal processor capable of detecting short (system, techniques for mitigating interference, and initial results from searches for FRBs.
A Time-predictable Memory Network-on-Chip

DEFF Research Database (Denmark)

Schoeberl, Martin; Chong, David VH; Puffitsch, Wolfgang

2014-01-01

To derive safe bounds on worst-case execution times (WCETs), all components of a computer system need to be time-predictable: the processor pipeline, the caches, the memory controller, and memory arbitration on a multicore processor. This paper presents a solution for time-predictable memory...... arbitration and access for chip-multiprocessors. The memory network-on-chip is organized as a tree with time-division multiplexing (TDM) of accesses to the shared memory. The TDM based arbitration completely decouples processor cores and allows WCET analysis of the memory accesses on individual cores without...
Accuracies Of Optical Processors For Adaptive Optics

Science.gov (United States)

Downie, John D.; Goodman, Joseph W.

1992-01-01

Paper presents analysis of accuracies and requirements concerning accuracies of optical linear-algebra processors (OLAP's) in adaptive-optics imaging systems. Much faster than digital electronic processor and eliminate some residual distortion. Question whether errors introduced by analog processing of OLAP overcome advantage of greater speed. Paper addresses issue by presenting estimate of accuracy required in general OLAP that yields smaller average residual aberration of wave front than digital electronic processor computing at given speed.
Making CSB + -Trees Processor Conscious

DEFF Research Database (Denmark)

Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

2005-01-01

of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose......Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....
Lipsi: Probably the Smallest Processor in the World

DEFF Research Database (Denmark)

Schoeberl, Martin

2018-01-01

While research on high-performance processors is important, it is also interesting to explore processor architectures at the other end of the spectrum: tiny processor cores for auxiliary functions. While it is common to implement small circuits for such functions, such as a serial port, in dedica...... at a minimal cost....
Recommending the heterogeneous cluster type multi-processor system computing

International Nuclear Information System (INIS)

Iijima, Nobukazu

2010-01-01

Real-time reactor simulator had been developed by reusing the equipment of the Musashi reactor and its performance improvement became indispensable for research tools to increase sampling rate with introduction of arithmetic units using multi-Digital Signal Processor(DSP) system (cluster). In order to realize the heterogeneous cluster type multi-processor system computing, combination of two kinds of Control Processor (CP) s, Cluster Control Processor (CCP) and System Control Processor (SCP), were proposed with Large System Control Processor (LSCP) for hierarchical cluster if needed. Faster computing performance of this system was well evaluated by simulation results for simultaneous execution of plural jobs and also pipeline processing between clusters, which showed the system led to effective use of existing system and enhancement of the cost performance. (T. Tanaka)
First level trigger processor for the ZEUS calorimeter

International Nuclear Information System (INIS)

Dawson, J.W.; Talaga, R.L.; Burr, G.W.; Laird, R.J.; Smith, W.; Lackey, J.

1990-01-01

This paper discusses the design of the first level trigger processor for the ZEUS calorimeter. This processor accepts data from the 13,000 photomultipliers of the calorimeter which is topologically divided into 16 regions, and after regional preprocessing, performs logical and numerical operations which cross regional boundaries. Because the crossing period at the HERA collider is 96 ns, it is necessary that first-level trigger decisions be made in pipelined hardware. One microsecond is allowed for the processor to perform the required logical and numerical operations, during which time the data from ten crossings would be resident in the processor while being clocked through the pipelined hardware. The circuitry is implemented in 100K ECL, Advanced CMOS discrete devices, and programmable gate arrays, and operates in a VME environment. All tables and registers are written/read from VME, and all diagnostic codes are executed from VME. Preprocessed data flows into the processor at a rate of 5.2GB/s, and processed data flows from the processor to the Global First-Level Trigger at a rate of 700MB/s. The system allows for subsets of the logic to be configured by software and for various important variables to be histogrammed as they flow through the processor. 2 refs., 3 figs
First-level trigger processor for the ZEUS calorimeter

International Nuclear Information System (INIS)

Dawson, J.W.; Talaga, R.L.; Burr, G.W.; Laird, R.J.; Smith, W.; Lackey, J.

1990-01-01

The design of the first-level trigger processor for the Zeus calorimeter is discussed. This processor accepts data from the 13,000 photomultipliers of the calorimeter, which is topologically divided into 16 regions, and after regional preprocessing performs logical and numerical operations that cross regional boundaries. Because the crossing period at the HERA collider is 96 ns, it is necessary that first-level trigger decisions be made in pipelined hardware. One microsecond is allowed for the processor to perform the required logical and numerical operations, during which time the data from ten crossings would be resident in the processor while being clocked through the pipelined hardware. The circuitry is implemented in 100K emitter-coupled logic (ECL), advanced CMOS discrete devices and programmable gate arrays, and operates in a VME environment. All tables and registers are written/read from VME, and all diagnostic codes are executed from VME. Preprocessed data flows into the processor at a rate of 5.2 Gbyte/s, and processed data flows from the processor to the global first-level trigger at a rate of 70 Mbyte/s. The system allows for subsets of the logic to be configured by software and for various important variables to be histogrammed as they flow through the processor
Software Optimization of Video Codecs on Pentium Processor with MMX Technology

Directory of Open Access Journals (Sweden)

Liu KJ Ray

2001-01-01

Full Text Available A key enabling technology for the proliferation of multimedia PC's is the availability of fast video codecs, which are the basic building blocks of many new multimedia applications. Since most industrial video coding standards (e.g., MPEG1, MPEG2, H.261, H.263 only specify the decoder syntax, there are a lot of rooms for optimization in a practical implementation. When considering a specific hardware platform like the PC, the algorithmic optimization must be considered in tandem with the architecture of the PC. Specifically, an algorithm that is optimal in the sense of number of operations needed may not be the fastest implementation on the PC. This is because special instructions are available which can perform several operations at once under special circumstances. In this work, we describe a fast implementation of H.263 video encoder for the Pentium processor with MMX technology. The described codec is adopted for video mail and video phone softwares used in IBM ThinkPad.
A digital retina-like low-level vision processor.

Science.gov (United States)

Mertoguno, S; Bourbakis, N G

2003-01-01

This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper.
Fork-join and data-driven execution models on multi-core architectures: Case study of the FMM

KAUST Repository

Amer, Abdelhalim; Maruyama, Naoya; Pericà s, Miquel; Taura, Kenjiro; Yokota, Rio; Matsuoka, Satoshi

2013-01-01

Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution

Suitability of magnetic single- and multi-core nanoparticles to detect protein binding with dynamic magnetic measurement techniques

International Nuclear Information System (INIS)

Remmer, Hilke; Dieckhoff, Jan; Schilling, Meinhard; Ludwig, Frank

2015-01-01

We investigated the binding of biotinylated proteins to various streptavidin functionalized magnetic nanoparticles with different dynamic magnetic measurement techniques to examine their potential for homogeneous bioassays. As particle systems, single-core nanoparticles with a nominal core diameter of 30 nm as well as multi-core nanoparticles with hydrodynamic sizes varying between nominally 60 nm and 100 nm were chosen. As experimental techniques, fluxgate magnetorelaxometry (MRX), complex ac susceptibility (ACS) and measurements of the phase lag between rotating field and sample magnetization are applied. MRX measurements are only suited for the detection of small analytes if the multivalency of functionalized nanoparticles and analytes causes cross-linking, thus forming larger aggregates. ACS measurements showed for all nanoparticle systems a shift of the imaginary part's maximum towards small frequencies. In rotating field measurements only the single-core nanoparticle systems with dominating Brownian mechanism exhibit an increase of the phase lag upon binding in the investigated frequency range. The coexistence of Brownian and Néel relaxation processes can cause a more complex phase lag change behavior, as demonstrated for multi-core nanoparticle systems. - Highlights: • Cealization of homogeneous magnetic bioassays using different magnetic techniques. • Comparison of single- and multi-core nanoparticle systems. • ac Susceptibility favorable for detection of small analytes. • Magnetorelaxometry favorable for detection of large analytes or cross-linking assays
Java Processor Optimized for RTSJ

Directory of Open Access Journals (Sweden)

Tu Shiliang

2007-01-01

Full Text Available Due to the preeminent work of the real-time specification for Java (RTSJ, Java is increasingly expected to become the leading programming language in real-time systems. To provide a Java platform suitable for real-time applications, a Java processor which can execute Java bytecode is directly proposed in this paper. It provides efficient support in hardware for some mechanisms specified in the RTSJ and offers a simpler programming model through ameliorating the scoped memory of the RTSJ. The worst case execution time (WCET of the bytecodes implemented in this processor is predictable by employing the optimization method proposed in our previous work, in which all the processing interfering predictability is handled before bytecode execution. Further advantage of this method is to make the implementation of the processor simpler and suited to a low-cost FPGA chip.
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber.

Science.gov (United States)

Zaghloul, Mohamed A S; Wang, Mohan; Milione, Giovanni; Li, Ming-Jun; Li, Shenping; Huang, Yue-Kai; Wang, Ting; Chen, Kevin P

2018-04-12

Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores' temperature and strain coefficients are such that temperature (strain) changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz), which is 2.63 (3.67) times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain) changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%).
Discrimination of Temperature and Strain in Brillouin Optical Time Domain Analysis Using a Multicore Optical Fiber

Directory of Open Access Journals (Sweden)

Mohamed A. S. Zaghloul

2018-04-01

Full Text Available Brillouin optical time domain analysis is the sensing of temperature and strain changes along an optical fiber by measuring the frequency shift changes of Brillouin backscattering. Because frequency shift changes are a linear combination of temperature and strain changes, their discrimination is a challenge. Here, a multicore optical fiber that has two cores is fabricated. The differences between the cores’ temperature and strain coefficients are such that temperature (strain changes can be discriminated with error amplification factors of 4.57 °C/MHz (69.11 μ ϵ /MHz, which is 2.63 (3.67 times lower than previously demonstrated. As proof of principle, using the multicore optical fiber and a commercial Brillouin optical time domain analyzer, the temperature (strain changes of a thermally expanding metal cylinder are discriminated with an error of 0.24% (3.7%.
Data collection from FASTBUS to a DEC UNIBUS processor through the UNIBUS-Processor Interface

International Nuclear Information System (INIS)

Larwill, M.; Barsotti, E.; Lesny, D.; Pordes, R.

1983-01-01

This paper describes the use of the UNIBUS Processor Interface, an interface between FASTBUS and the Digital Equipment Corporation UNIBUS. The UPI was developed by Fermilab and the University of Illinois. Details of the use of this interface in a high energy physics experiment at Fermilab are given. The paper includes a discussion of the operation of the UPI on the UNIBUS of a VAX-11, and plans for using the UPI to perform data acquisition from FASTBUS to a VAX-11 Processor
Ultrathin endoscopes based on multicore fibers and adaptive optics: a status review and perspectives.

Science.gov (United States)

Andresen, Esben Ravn; Sivankutty, Siddharth; Tsvirkun, Viktor; Bouwmans, Géraud; Rigneault, Hervé

2016-12-01

We take stock of the progress that has been made into developing ultrathin endoscopes assisted by wave front shaping. We focus our review on multicore fiber-based lensless endoscopes intended for multiphoton imaging applications. We put the work into perspective by comparing with alternative approaches and by outlining the challenges that lie ahead.
New development for low energy electron beam processor

International Nuclear Information System (INIS)

Takei, Taro; Goto, Hitoshi; Oizumi, Matsutoshi; Hirakawa, Tetsuya; Ochi, Masafumi

2003-01-01

Newly developed low-energy electron beam (EB) processors that have unique designs and configurations compared to conventional ones enable electron-beam treatment of small three-dimensional objects, such as grain-like agricultural products and small plastic parts. As the EB processor can irradiate the products from the whole angles, the uniform EB treatment can be achieved at one time regardless the complex shapes of the product. Here presented are two new EB processors: the first system has cylindrical process zone, which allows three-dimensional objects to be irradiated with one-pass treatment. The second is a tube-type small EB processor, achieving not only its compactor design, but also higher beam extraction efficiency and flexible installation of the irradiation heads. The basic design of each processor and potential applications with them will be presented in this paper. (author)
Application Scalability and Performance on Multicore Architectures

National Research Council Canada - National Science Library

Simon, Tyler A; Cable, Sam B; Mahmoodi, Mahin

2007-01-01

The US Army Engineer Research and Development Center Major Shared Resource Center has recently upgraded its Cray XT3 system from single-core to dual-core AMD Opteron processors and has procured a quad...
Smart Trigger Pre-Processor Custom Electronics for the PHENIX Experiment

International Nuclear Information System (INIS)

Nagle, James L.

2003-01-01

OAK-B135 The document provides a final technical report on activities and accomplishments of the experimental relativistic heavy ion physics group at the University of Colorado at Boulder as supported by the Outstanding Junior Investigator Program, Division of Nuclear Physics at the Department of Energy. All of the goals of the grant proposal were achieved during this last year of the Outstanding Junior Investigator funding period. The development of a Smart Trigger Pre-Processor module for fast trigger primitive calculations in the PHENIX experiment has been completed. We finalized the board design, constructed and tested two prototype modules, and with additional funding from the PHENIX project, we fabricated a full set of 15 modules for the Muon Tracking system. During Run-4 at RHIC:, we have begun the process of integrating these modules into the PHENIX data acquisition system, Additionally, we put a large Effort into developing new trigger and fast-track analysis methods for J j J data filtering and reconstruction. These algorithms make use of the trigger primitivE∼s generated via the new electronics
Parallel Implementation of the Discrete Green's Function Formulation of the FDTD Method on a Multicore Central Processing Unit

Directory of Open Access Journals (Sweden)

T. Stefański

2014-12-01

Full Text Available Parallel implementation of the discrete Green's function formulation of the finite-difference time-domain (DGF-FDTD method was developed on a multicore central processing unit. DGF-FDTD avoids computations of the electromagnetic field in free-space cells and does not require domain termination by absorbing boundary conditions. Computed DGF-FDTD solutions are compatible with the FDTD grid enabling the perfect hybridization of FDTD with the use of time-domain integral equation methods. The developed implementation can be applied to simulations of antenna characteristics. For the sake of example, arrays of Yagi-Uda antennas were simulated with the use of parallel DGF-FDTD. The efficiency of parallel computations was investigated as a function of the number of current elements in the FDTD grid. Although the developed method does not apply the fast Fourier transform for convolution computations, advantages stemming from the application of DGF-FDTD instead of FDTD can be demonstrated for one-dimensional wire antennas when simulation results are post-processed by the near-to-far-field transformation.
A data base processor semantics specification package

Science.gov (United States)

Fishwick, P. A.

1983-01-01

A Semantics Specification Package (DBPSSP) for the Intel Data Base Processor (DBP) is defined. DBPSSP serves as a collection of cross assembly tools that allow the analyst to assemble request blocks on the host computer for passage to the DBP. The assembly tools discussed in this report may be effectively used in conjunction with a DBP compatible data communications protocol to form a query processor, precompiler, or file management system for the database processor. The source modules representing the components of DBPSSP are fully commented and included.
Globe hosts launch of new processor

CERN Multimedia

2006-01-01

Launch of the quadecore processor chip at the Globe. On 14 November, in a series of major media events around the world, the chip-maker Intel launched its new 'quadcore' processor. For the regions of Europe, the Middle East and Africa, the day-long launch event took place in CERN's Globe of Science and Innovation, with over 30 journalists in attendance, coming from as far away as Johannesburg and Dubai. CERN was a significant choice for the event: the first tests of this new generation of processor in Europe had been made at CERN over the preceding months, as part of CERN openlab, a research partnership with leading IT companies such as Intel, HP and Oracle. The event also provided the opportunity for the journalists to visit ATLAS and the CERN Computer Centre. The strategy of putting multiple processor cores on the same chip, which has been pursued by Intel and other chip-makers in the last few years, represents an important departure from the more traditional improvements in the sheer speed of such chips. ...
Acoustooptic linear algebra processors - Architectures, algorithms, and applications

Science.gov (United States)

Casasent, D.

1984-01-01

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
Simulation of a processor switching circuit with APLSV

International Nuclear Information System (INIS)

Dilcher, H.

1979-01-01

The report describes the simulation of a processor switching circuit with APL. Furthermore an APL function is represented to simulate a processor in an assembly like language. Both together serve as a tool for studying processor properties. By means of the programming function it is also possible to program other simulated processors. The processor is to be used in the processing of data in real time analysis that occur in high energy physics experiments. The data are already offered to the computer in digitalized form. A typical data rate is at 10 KB/ sec. The data are structured in blocks. The particular blocks are 1 KB wide and are independent from each other. Aprocessor has to decide, whether the block data belong to an event that is part of the backround noise and can therefore be forgotten, or whether the data should be saved for a later evaluation. (orig./WB) [de
Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

KAUST Repository

Ltaief, Hatem

2016-06-02

We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used as an operational testbed for simulating the de- sign of new instruments for the European Extremely Large Telescope project (E-ELT), the world\\'s biggest eye and one of Europe\\'s highest priorities in ground-based astronomy. The simulation corresponds to a multi-step multi-stage pro- cedure, which is fed, near real-time, by system and turbulence data coming from the telescope environment. Based on the PLASMA library powered by the OmpSs dynamic runtime system, our implementation relies on a task-based programming model to permit an asynchronous out-of-order execution. Using modern multicore architectures associated with the enormous computing power of GPUS, the resulting data-driven compute-intensive simulation of the entire MOAO application, composed of the tomographic reconstructor and the observing sequence, is capable of coping with the aforementioned real-time challenge and stands as a reference implementation for the computational astronomy community.
Accuracy Limitations in Optical Linear Algebra Processors

Science.gov (United States)

Batsell, Stephen Gordon

1990-01-01

One of the limiting factors in applying optical linear algebra processors (OLAPs) to real-world problems has been the poor achievable accuracy of these processors. Little previous research has been done on determining noise sources from a systems perspective which would include noise generated in the multiplication and addition operations, noise from spatial variations across arrays, and from crosstalk. In this dissertation, we propose a second-order statistical model for an OLAP which incorporates all these system noise sources. We now apply this knowledge to determining upper and lower bounds on the achievable accuracy. This is accomplished by first translating the standard definition of accuracy used in electronic digital processors to analog optical processors. We then employ our second-order statistical model. Having determined a general accuracy equation, we consider limiting cases such as for ideal and noisy components. From the ideal case, we find the fundamental limitations on improving analog processor accuracy. From the noisy case, we determine the practical limitations based on both device and system noise sources. These bounds allow system trade-offs to be made both in the choice of architecture and in individual components in such a way as to maximize the accuracy of the processor. Finally, by determining the fundamental limitations, we show the system engineer when the accuracy desired can be achieved from hardware or architecture improvements and when it must come from signal pre-processing and/or post-processing techniques.
The Heidelberg POLYP - a flexible and fault-tolerant poly-processor

International Nuclear Information System (INIS)

Maenner, R.; Deluigi, B.

1981-01-01

The Heidelberg poly-processor system POLYP is described. It is intended to be used in nuclear physics for reprocessing of experimental data, in high energy physics as second-stage trigger processor, and generally in other applications requiring high-computing power. The POLYP system consists of any number of I/O-processors, processor modules (eventually of different types), global memory segments, and a host processor. All modules (up to several hundred) are connected by a multiple common-data-bus system; all processors, additionally, by a multiple sync bus system for processor/task-scheduling. All hard- and software is designed to be decentralized and free of bottle-necks. Most hardware-faults like single-bit errors in memory or multi-bit errors during transfers are automatically corrected. Defective modules, buses, etc., can be removed with only a graceful degradation of the system-throughput. (orig.)
Programmer's guide to FFE: a fast front-end data-acquisition program

International Nuclear Information System (INIS)

Million, D.L.

1983-05-01

The Large Coil Test Facility project of the Fusion Energy Division has a data acquisition system which includes a large host computer and several small, peripheral front-end computers. The front-end processors handle details of data acquisition under the control of the host and pass data back to the host for storage. Some of the front ends are known as fast front ends and are required to collect a maximum of 64,000 samples each second. This speed and other hardware constraints resulted in a need for a stand-alone, assembly language task which could be downline loaded from the host system into the fast front ends. FFE (Fast Front End) was written to satisfy this need. It was written in the PDP-11 MACRO-11 assembly language for an LSI-11/23 processor. After the host loads the task into the front end, it controls the data acquisition process with a series of commands and parameters. This Programmer's Guide describes the structure and operation of FFE in detail from a programming point of view. A companion User's guide provides more information on the use of the program from the host system
3D fast reconstruction in positron emission tomography

International Nuclear Information System (INIS)

Egger, M.L.; Scheurer, A. Hermann; Joseph, C.; Morel, C.

1996-01-01

The issue of long reconstruction times in positron emission tomography (PET) has been addressed from several points of view, resulting in an affordable dedicated system capable of handling routine 3D reconstructions in a few minutes per frame : on the hardware side using fast processors and a parallel architecture, and on the software side, using efficient implementation of computationally less intensive algorithms
A dedicated line-processor as used at the SHF

International Nuclear Information System (INIS)

Bevan, A.V.; Hatley, R.W.; Price, D.R.; Rankin, P.

1985-01-01

A hardwired trigger processor was used at the SLAC Hybrid Facility to find evidence for charged tracks originating from the fiducial volume of a 40'' rapidcycling bubble chamber. Straight-line projections of these tracks in the plane perpendicular to the applied magnetic field were searched for using data from three sets of proportional wire chambers (PWC). This information was made directly available to the processor by means of a special digitizing card. The results memory of the processor simulated read-only memory in a 168/E processor and was accessible by it. The 168/E controlled the issuing of a trigger command to the bubble chamber flash tubes. The same design of digitizer card used by the line processor was incorporated into the 168/E, again as read only memory, which allowed it access to the raw data for continual monitoring of trigger integrity. The design logic of the trigger processor was verified by running real PWC data through a FORTRAN simulation of the hardware. This enabled the debugging to become highly automated since a step by step, computer controlled comparison of processor registers to simulation predictions could be made

Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers

KAUST Repository

Wu, Xingfu; Taylor, Valerie

2011-01-01

The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication between nodes. In this paper, we use SP and BT benchmarks of MPI NPB 3.3 as a basis for a comparative approach to implement hybrid MPI/OpenMP versions of SP and BT. In particular, we can compare the performance of the hybrid SP and BT with the MPI counterparts on large-scale multicore supercomputers. Our performance results indicate that the hybrid SP outperforms the MPI SP by up to 20.76%, and the hybrid BT outperforms the MPI BT by up to 8.58% on up to 10,000 cores on BlueGene/P at Argonne National Laboratory and Jaguar (Cray XT4/5) at Oak Ridge National Laboratory. We also use performance tools and MPI trace libraries available on these supercomputers to further investigate the performance characteristics of the hybrid SP and BT.
Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers

KAUST Repository

Wu, Xingfu

2011-03-29

The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication between nodes. In this paper, we use SP and BT benchmarks of MPI NPB 3.3 as a basis for a comparative approach to implement hybrid MPI/OpenMP versions of SP and BT. In particular, we can compare the performance of the hybrid SP and BT with the MPI counterparts on large-scale multicore supercomputers. Our performance results indicate that the hybrid SP outperforms the MPI SP by up to 20.76%, and the hybrid BT outperforms the MPI BT by up to 8.58% on up to 10,000 cores on BlueGene/P at Argonne National Laboratory and Jaguar (Cray XT4/5) at Oak Ridge National Laboratory. We also use performance tools and MPI trace libraries available on these supercomputers to further investigate the performance characteristics of the hybrid SP and BT.
Fast decision algorithms in low-power embedded processors for quality-of-service based connectivity of mobile sensors in heterogeneous wireless sensor networks.

Science.gov (United States)

Jaraíz-Simón, María D; Gómez-Pulido, Juan A; Vega-Rodríguez, Miguel A; Sánchez-Pérez, Juan M

2012-01-01

When a mobile wireless sensor is moving along heterogeneous wireless sensor networks, it can be under the coverage of more than one network many times. In these situations, the Vertical Handoff process can happen, where the mobile sensor decides to change its connection from a network to the best network among the available ones according to their quality of service characteristics. A fitness function is used for the handoff decision, being desirable to minimize it. This is an optimization problem which consists of the adjustment of a set of weights for the quality of service. Solving this problem efficiently is relevant to heterogeneous wireless sensor networks in many advanced applications. Numerous works can be found in the literature dealing with the vertical handoff decision, although they all suffer from the same shortfall: a non-comparable efficiency. Therefore, the aim of this work is twofold: first, to develop a fast decision algorithm that explores the entire space of possible combinations of weights, searching that one that minimizes the fitness function; and second, to design and implement a system on chip architecture based on reconfigurable hardware and embedded processors to achieve several goals necessary for competitive mobile terminals: good performance, low power consumption, low economic cost, and small area integration.
SkyAlign: a portable, work-efficient skyline algorithm for multicore and GPU architectures

DEFF Research Database (Denmark)

Bøgh, Kenneth Sejdenfaden; Chester, Sean; Assent, Ira

2016-01-01

The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely on ...... native multicore state of the art on challenging workloads by an increasing margin as more cores and sockets are utilised....
On-chip COMA cache-coherence protocol for microgrids of microthreaded cores

NARCIS (Netherlands)

Zhang, L.; Jesshope, C.

2008-01-01

This paper describes an on-chip COMA cache coherency protocol to support the microthread model of concurrent program composition. The model gives a sound basis for building multi-core computers as it captures concurrency, abstracts communication and identifies resources, such as processor groups
Embedded processor extensions for image processing

Science.gov (United States)

Thevenin, Mathieu; Paindavoine, Michel; Letellier, Laurent; Heyrman, Barthélémy

2008-04-01

The advent of camera phones marks a new phase in embedded camera sales. By late 2009, the total number of camera phones will exceed that of both conventional and digital cameras shipped since the invention of photography. Use in mobile phones of applications like visiophony, matrix code readers and biometrics requires a high degree of component flexibility that image processors (IPs) have not, to date, been able to provide. For all these reasons, programmable processor solutions have become essential. This paper presents several techniques geared to speeding up image processors. It demonstrates that a gain of twice is possible for the complete image acquisition chain and the enhancement pipeline downstream of the video sensor. Such results confirm the potential of these computing systems for supporting future applications.
Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

KAUST Repository

Malas, T.

2015-09-13

Today’s high-end multicore systems are characterized by a deep memory hierarchy, i.e., several levels of local and shared caches, with limited size and bandwidth per core. The ever-increasing gap between the processor and memory speed will further exacerbate the problem and has lead the scientific community to revisit numerical software implementations to better suit the underlying memory subsystem for performance (data reuse) as well as energy efficiency (data locality). The authors propose a novel multi-threaded wavefront diamond blocking (MWD) implementation in the context of stencil computations, which represents the core operation for seismic imaging in oil industry. The stencil diamond formulation introduces temporal blocking for high data reuse in the upper cache levels. The wavefront optimization technique ensures data locality by allowing multiple threads to share common adjacent point stencil. Therefore, MWD is able to take up the aforementioned challenges by alleviating the cache size limitation and releasing pressure from the memory bandwidth. Performance comparisons are shown against the optimized 25-point stencil standard seismic imaging scheme using spatial and temporal blocking and demonstrate the effectiveness of MWD.
Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms

Directory of Open Access Journals (Sweden)

Dominik Zurek

2013-01-01

Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.
Fast Multipole-Based Preconditioner for Sparse Iterative Solvers

KAUST Repository

Ibeid, Huda; Yokota, Rio; Keyes, David E.

2014-01-01

Among optimal hierarchical algorithms for the computational solution of elliptic problems, the Fast Multipole Method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxed global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Compared with multilevel methods, it is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it has superior multicore and distributed memory scalability properties on commodity architecture supercomputers.
Fast Multipole-Based Preconditioner for Sparse Iterative Solvers

KAUST Repository

Ibeid, Huda

2014-05-04

Among optimal hierarchical algorithms for the computational solution of elliptic problems, the Fast Multipole Method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxed global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Compared with multilevel methods, it is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it has superior multicore and distributed memory scalability properties on commodity architecture supercomputers.
Sojourn time tails in processor-sharing systems

NARCIS (Netherlands)

Egorova, R.R.

2009-01-01

The processor-sharing discipline was originally introduced as a modeling abstraction for the design and performance analysis of the processing unit of a computer system. Under the processor-sharing discipline, all active tasks are assumed to be processed simultaneously, receiving an equal share of
An interactive parallel processor for data analysis

International Nuclear Information System (INIS)

Mong, J.; Logan, D.; Maples, C.; Rathbun, W.; Weaver, D.

1984-01-01

A parallel array of eight minicomputers has been assembled in an attempt to deal with kiloparameter data events. By exporting computer system functions to a separate processor, the authors have been able to achieve computer amplification linearly proportional to the number of executing processors
Fast semivariogram computation using FPGA architectures

Science.gov (United States)

Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang

2015-02-01

. Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices
Optical Array Processor: Laboratory Results

Science.gov (United States)

Casasent, David; Jackson, James; Vaerewyck, Gerard

1987-01-01

A Space Integrating (SI) Optical Linear Algebra Processor (OLAP) is described and laboratory results on its performance in several practical engineering problems are presented. The applications include its use in the solution of a nonlinear matrix equation for optimal control and a parabolic Partial Differential Equation (PDE), the transient diffusion equation with two spatial variables. Frequency-multiplexed, analog and high accuracy non-base-two data encoding are used and discussed. A multi-processor OLAP architecture is described and partitioning and data flow issues are addressed.
Multiprocessor Real-Time Scheduling with Hierarchical Processor Affinities

OpenAIRE

Bonifaci , Vincenzo; Brandenburg , Björn; D'Angelo , Gianlorenzo; Marchetti-Spaccamela , Alberto

2016-01-01

International audience; Many multiprocessor real-time operating systems offer the possibility to restrict the migrations of any task to a specified subset of processors by setting affinity masks. A notion of " strong arbitrary processor affinity scheduling " (strong APA scheduling) has been proposed; this notion avoids schedulability losses due to overly simple implementations of processor affinities. Due to potential overheads, strong APA has not been implemented so far in a real-time operat...
Hardware trigger processor for the MDT system

CERN Document Server

AUTHOR|(SzGeCERN)757787; The ATLAS collaboration; Hazen, Eric; Butler, John; Black, Kevin; Gastler, Daniel Edward; Ntekas, Konstantinos; Taffard, Anyes; Martinez Outschoorn, Verena; Ishino, Masaya; Okumura, Yasuyuki

2017-01-01

We are developing a low-latency hardware trigger processor for the Monitored Drift Tube system in the Muon spectrometer. The processor will fit candidate Muon tracks in the drift tubes in real time, improving significantly the momentum resolution provided by the dedicated trigger chambers. We present a novel pure-FPGA implementation of a Legendre transform segment finder, an associative-memory alternative implementation, an ARM (Zynq) processor-based track fitter, and compact ATCA carrier board architecture. The ATCA architecture is designed to allow a modular, staged approach to deployment of the system and exploration of alternative technologies.
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

OpenAIRE

Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann

2012-01-01

Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research ...
Towards a Process Algebra for Shared Processors

DEFF Research Database (Denmark)

Buchholtz, Mikael; Andersen, Jacob; Løvengreen, Hans Henrik

2002-01-01

We present initial work on a timed process algebra that models sharing of processor resources allowing preemption at arbitrary points in time. This enables us to model both the functional and the timely behaviour of concurrent processes executed on a single processor. We give a refinement relation...
Initial validation of ATLAS software on the ARM architecture

Energy Technology Data Exchange (ETDEWEB)

Kawamura, Gen; Quadt, Arnulf; Smith, Joshua Wyatt [II. Physikalisches Institut, Georg-August Universitaet Goettingen (Germany); Seuster, Rolf [TRIUMF (Canada); Stewart, Graeme [University of Glasgow (United Kingdom)

2016-07-01

In the early 2000's the introduction of the multi-core era of computing helped industry and experiments such as ATLAS realize even more computing power. This was necessary as the limits of what a single-core processor could do where quickly being reached. Our current model of computing is to increase the number of multi-core nodes in a server farm in order to handle the increased influx of data. As power costs and our need for more computing power increase, this model will eventually become non-realistic. Once again a paradigm shift has to take place. One such option is to look at alternative architectures for large scale server farms. ARM processors are such an example. Making up approximately 95 % of the smartphone and tablet market these processors are widely available, very power conservative and constantly becoming faster. The ATLAS software code base (Athena) is extremely complex comprising of more than 6.5 million lines of code. It has very recently been ported to the ARM 64-bit architecture. The process of our port as well as the first validation plots are presented and compared to the traditional x86 architecture.
High-speed packet filtering utilizing stream processors

Science.gov (United States)

Hummel, Richard J.; Fulp, Errin W.

2009-04-01

Parallel firewalls offer a scalable architecture for the next generation of high-speed networks. While these parallel systems can be implemented using multiple firewalls, the latest generation of stream processors can provide similar benefits with a significantly reduced latency due to locality. This paper describes how the Cell Broadband Engine (CBE), a popular stream processor, can be used as a high-speed packet filter. Results show the CBE can potentially process packets arriving at a rate of 1 Gbps with a latency less than 82 μ-seconds. Performance depends on how well the packet filtering process is translated to the unique stream processor architecture. For example the method used for transmitting data and control messages among the pseudo-independent processor cores has a significant impact on performance. Experimental results will also show the current limitations of a CBE operating system when used to process packets. Possible solutions to these issues will be discussed.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.