WorldWideScience

Sample records for fast multicore processor

  1. Periodic activity migration for fast sequential execution in future heterogeneous multicore processors

    OpenAIRE

    Michaud, Pierre

    2008-01-01

    On each new technology generation, miniaturization permits putting twice as many computing cores on the same silicon area, potentially doubling the processor performance. However, if sequential execution is not accelerated at the same time, Amdahl's law will eventually limit the actual performance. Hence it will be beneficial to have asymmetric multicores where some cores are specialized for fast sequential execution. This specialization may be achieved by architectural means, but it may also...

  2. A Fast Scheme to Investigate Thermal-Aware Scheduling Policy for Multicore Processors

    Science.gov (United States)

    He, Liqiang; Narisu, Cha

    With more cores integrated into one single chip, the overall power consumption from the multiple concurrent running programs increases dramatically in a CMP processor which causes the thermal problem becomes more and more severer than the traditional superscalar processor. To leverage the thermal problem of a multicore processor, two kinds of orthogonal technique can be exploited. One is the commonly used Dynamic Thermal Management technique. The other is the thermal aware thread scheduling policy. For the latter one, some general ideas have been proposed by academic and industry researchers. The difficult to investigate the effectiveness of a thread scheduling policy is the huge search space coming from the different possible mapping combinations for a given multi-program workload. In this paper, we extend a simple thermal model originally used in a single core processor to a multicore environment and propose a fast scheme to search or compare the thermal effectiveness of different scheduling policies using the new model. The experiment results show that the proposed scheme can predict the thermal characteristics of the different scheduling policies with a reasonable accuracy and help researchers to fast investigate the performances of the policies without detailed time consuming simulations.

  3. Tiled Multicore Processors

    Science.gov (United States)

    Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

    For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

  4. Enabling Future Robotic Missions with Multicore Processors

    Science.gov (United States)

    Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

    2011-01-01

    Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.

  5. Taxonomy of Data Prefetching for Multicore Processors

    Institute of Scientific and Technical Information of China (English)

    Surendra Byna; Yong Chen; Xian-He Sun

    2009-01-01

    Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream.While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.

  6. Multi-core processors - An overview

    CERN Document Server

    Venu, Balaji

    2011-01-01

    Microprocessors have revolutionized the world we live in and continuous efforts are being made to manufacture not only faster chips but also smarter ones. A number of techniques such as data level parallelism, instruction level parallelism and hyper threading (Intel's HT) already exists which have dramatically improved the performance of microprocessor cores. This paper briefs on evolution of multi-core processors followed by introducing the technology and its advantages in today's world. The paper concludes by detailing on the challenges currently faced by multi-core processors and how the industry is trying to address these issues.

  7. A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

    Energy Technology Data Exchange (ETDEWEB)

    Aliaga, José I., E-mail: aliaga@uji.es [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Alonso, Pedro [Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València (Spain); Badía, José M. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Chacón, Pablo [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Davidović, Davor [Rudjer Bošković Institute, Centar za Informatiku i Računarstvo – CIR, Zagreb (Croatia); López-Blanco, José R. [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Quintana-Ortí, Enrique S. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain)

    2016-03-15

    We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.

  8. Models of Communication for Multicore Processors

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Sørensen, Rasmus Bo; Sparsø, Jens

    2015-01-01

    To efficiently use multicore processors we need to ensure that almost all data communication stays on chip, i.e., the bits moved between tasks executing on different processor cores do not leave the chip. Different forms of on-chip communication are supported by different hardware mechanism, e.......g., shared caches with cache coherency protocols, core-to-core networks-on-chip, and shared scratchpad memories. In this paper we explore the different hardware mechanism for on-chip communication and how they support or favor different models of communication. Furthermore, we discuss the usability...... of the different models of communication for real-time systems....

  9. Models of Communication for Multicore Processors

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Sørensen, Rasmus Bo; Sparsø, Jens

    2015-01-01

    To efficiently use multicore processors we need to ensure that almost all data communication stays on chip, i.e., the bits moved between tasks executing on different processor cores do not leave the chip. Different forms of on-chip communication are supported by different hardware mechanism, e.......g., shared caches with cache coherency protocols, core-to-core networks-on-chip, and shared scratchpad memories. In this paper we explore the different hardware mechanism for on-chip communication and how they support or favor different models of communication. Furthermore, we discuss the usability...... of the different models of communication for real-time systems....

  10. Multi-Core Processor Memory Contention Benchmark Analysis Case Study

    Science.gov (United States)

    Simon, Tyler; McGalliard, James

    2009-01-01

    Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.

  11. A High Performance Multi-Core FPGA Implementation for 2D Pixel Clustering for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Beretta, M; Gkaitatzis, S; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The high performance multi-core 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors read out drivers (RODs) at 760Gbps, the full rate of level 1 triggers. Clustering is required as a method to reduce the high rate of the received data before further processing, as well as to determine the cluster centroid for obtaining obtain the best spatial measurement. Our implementation targets the pixel detectors and uses a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The design is fully generic and the cluster detection window size can be adjusted for optimizing the cluster identification process. Τhe implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility mak...

  12. Message Passing on a Time-predictable Multicore Processor

    DEFF Research Database (Denmark)

    Sørensen, Rasmus Bo; Puffitsch, Wolfgang; Schoeberl, Martin

    2015-01-01

    Real-time systems need time-predictable computing platforms. For a multicore processor to be time-predictable, communication between processor cores needs to be time-predictable as well. This paper presents a time-predictable message-passing library for such a platform. We show how to build up...

  13. PERFORMANCE OF PRIVATE CACHE REPLACEMENT POLICIES FOR MULTICORE PROCESSORS

    Directory of Open Access Journals (Sweden)

    Matthew Lentz

    2014-07-01

    Full Text Available Multicore processors have become ubiquitous, both in general-purpose and special-purpose applications. With the number of transistors in a chip continuing to increase, the number of cores in a processor is also expected to increase. Cache replacement policy is an important design parameter of a cache hierarchy. As most of the processor designs have become multicore, there is a need to study cache replacement policies for multi-core systems. Previous studies have focused on the shared levels of the multicore cache hierarchy. In this study, we focus on the top level of the hierarchy, which bears the brunt of the memory requests emanating from each processor core. We measure the miss rates of various cache replacement policies, as the number of cores is steadily increased from 1 to 16. The study was done by modifying the publicly available SESC simulator, which models in detail a multicore processor with a multilevel cache hierarchy. Our experimental results show that for the private L1 caches, the LRU (Least Recently Used replacement policy outperforms all of the other replacement policies. This is in contrast to what was observed in previous studies for the shared L2 cache. The results presented in this paper are useful for hardware designers to optimize their cache designs or the program codes.

  14. COTS Multicore Processors in Avionics Systems: Challenges and Solutions

    Science.gov (United States)

    2015-01-06

    COTS Multicore Processors in Avionics Systems: Challenges and Solutions Dionisio de Niz Bjorn Andersson and Lutz Wrage dionisio@sei.cmu.edu... Avionics Systems: Challenges and Solutions 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) Wrage /Dionisio de Niz Bjorn...NVIDIA Tegra K1 platform • Avionics and defense: – Rugged Intel i7 single board computers – Freescale P4080 8-core CPU 4 Shared Hardware: Multicore Memory

  15. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    Science.gov (United States)

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  16. Speedup bioinformatics applications on multicore-based processor using vectorizing and multithreading strategies.

    Science.gov (United States)

    Chaichoompu, Kridsadakorn; Kittitornkun, Surin; Tongsima, Sissades

    2007-12-30

    Many computational intensive bioinformatics software, such as multiple sequence alignment, population structure analysis, etc., written in C/C++ are not multicore-aware. A multicore processor is an emerging CPU technology that combines two or more independent processors into a single package. The Single Instruction Multiple Data-stream (SIMD) paradigm is heavily utilized in this class of processors. Nevertheless, most popular compilers including Microsoft Visual C/C++ 6.0, x86 gnu C-compiler gcc do not automatically create SIMD code which can fully utilize the advancement of these processors. To harness the power of the new multicore architecture certain compiler techniques must be considered. This paper presents a generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++. The proposed framework contains 2 main steps: multithreading and vectorizing strategies. After following the strategies, the application can achieve higher speedup by taking the advantage of multicore architecture technology. Due to the extremely fast interconnection networking among multiple cores, it is suggested that the proposed optimization could be more appropriate than making use of parallelization on a small cluster computer which has larger network latency and lower bandwidth.

  17. Multi Microkernel Operating Systems for Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Rami Matarneh

    2009-01-01

    Full Text Available Problem statement: In the midst of the huge development in processors industry as a response to the increasing demand for high-speed processors manufacturers were able to achieve the goal of producing the required processors, but this industry disappointed hopes, because it faced problems not amenable to solution, such as complexity, hard management and large consumption of energy. These problems forced the manufacturers to stop the focus on increasing the speed of processors and go toward parallel processing to increase performance. This eventually produced multi-core processors with high-performance, if used properly. Unfortunately, until now, these processors did not use as it should be used; because of lack support of operating system and software applications. Approach: The approach based on the assumption that single-kernel operating system was not enough to manage multi-core processors to rethink the construction of multi-kernel operating system. One of these kernels serves as the master kernel and the others serve as slave kernels. Results: Theoretically, the proposed model showed that it can do much better than the existing models; because it supported single-threaded processing and multi-threaded processing at the same time, in addition, it can make better use of multi-core processors because it divided the load almost equally between the cores and the kernels which will lead to a significant improvement in the performance of the operating system. Conclusion: Software industry needed to get out of the classical framework to be able to keep pace with hardware development, this objective was achieved by re-thinking building operating systems and software in a new innovative methodologies and methods, where the current theories of operating systems were no longer capable of achieving the aspirations of future.

  18. Interactive high-resolution isosurface ray casting on multicore processors.

    Science.gov (United States)

    Wang, Qin; JaJa, Joseph

    2008-01-01

    We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.

  19. Hardware Synchronization for Embedded Multi-Core Processors

    DEFF Research Database (Denmark)

    Stoif, Christian; Schoeberl, Martin; Liccardi, Benito

    2011-01-01

    Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper, establi......Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper...

  20. Effective Utilization of Multicore Processor for Unified Threat Management Functions

    Directory of Open Access Journals (Sweden)

    Radhakrishnan Shanmugasundaram

    2012-01-01

    Full Text Available Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Unified threat management is one such application that has multiple functions to be implemented at high speeds. Increasing performance of the system by knowing the nature of the functionality and effective utilization of multiple processors for each of the functions warrants detailed experimentation. In this study, some of the functions of Unified Threat Management are implemented using multiple processors for each of the functions. Approach: This evaluation was conducted on SunfireT1000 server having Sun Ultras ARC T1 multicore processor. OpenMP parallelization methods are used for scheduling the logical CPUs for the parallelized application. Results: Execution time for some of the UTM functions implemented was analyzed to arrive at an effective allocation and parallelization methodology that is dependent on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the type of parallelization method for the implemented UTM functions are suggested.

  1. State-based Communication on Time-predictable Multicore Processors

    DEFF Research Database (Denmark)

    Sørensen, Rasmus Bo; Schoeberl, Martin; Sparsø, Jens

    2016-01-01

    Some real-time systems use a form of task-to-task communication called state-based or sample-based communication that does not impose any flow control among the communicating tasks. The concept is similar to a shared variable, where a reader may read the same value multiple times or may not read...... a given value at all. This paper explores time-predictable implementations of state-based communication in network-on-chip based multicore platforms through five algorithms. With the presented analysis of the implemented algorithms, the communicating tasks of one core can be scheduled independently...... of tasks on other cores. Assuming a specific time-predictable multicore processor, we evaluate how the read and write primitives of the five algorithms contribute to the worst-case execution time of the communicating tasks. Each of the five algorithms has specific capabilities that make them suitable...

  2. A fast location mechanism on memory data error for multi-core processors verification%多核处理器验证中存储数据错误快速定位机制

    Institute of Scientific and Technical Information of China (English)

    周宏伟; 邓让钰; 李永进; 晏小波; 窦强

    2012-01-01

    提出并实现的一种数据错误快速定位机制(Fast Fault Location Mechanism,FFLM)面向多核处理器存储系统的功能验证,FFLM基于硬件仿真器构建多端口存储器黄金模型,通过在仿真过程中实时监控存储系统与处理器核之间的访存报文,在线比较被测系统访问真实存储器的数据与黄金模型中的对应数据是否一致,在错误数据从存储系统送入处理器核的时刻就能够发现数据错误.与传统方法相比,FFLM具有仿真速度快、硬件资源代价低以及定位错误时间短的优点.对自主设计的CMP-16多核处理器进行仿真时的统计数据表明:使用FFLM后定位数据错误的速度能够比未使用FFLM时平均提高6.5倍.%A fast fault location mechanism on memory data error, which is called FFLM for a self-made CMP-16 multi-core processor' s functional validation, is proposed and realized. FFLM builds a multi-port golden memory model based on the hardware emulation accelerator. It monitors the packages of memory access between memory system and processor cores during the emulation, real-time compares the data from real memory system being tested and the data from golden memory model, judges whether they are consistent, and finds the errors once any wrong data is sent to processor core from memory system. Compared with traditional ways, FFLM has the advantages of fast emulation speed, low hardware cost and low fault Location time cost. Statistical results from the emulation for a self-made CMP-16 multi-processor show that FFLM improves the speed of date fault location in memory system by 6.5 times averagely.

  3. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    Science.gov (United States)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  4. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    Science.gov (United States)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  5. Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems

    Science.gov (United States)

    2013-12-01

    TR-14-7 Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems Approved for public release...Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems DTRA01-03-D-0026 Prof. Luke Lester and Prof. Ganesh...release; distribution is unlimited. The University of New Mexico has undertaken a study to determine the effects of radiation on Quantum Dot Photonic

  6. Energy Efficiency of a Multi-Core Processor by Tag Reduction

    Institute of Scientific and Technical Information of China (English)

    Long Zheng; Mian-Xiong Dong; Kaoru Ota; Hai Jin; Song Guo; Jun Ma

    2011-01-01

    We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor.

  7. Fast, Massively Parallel Data Processors

    Science.gov (United States)

    Heaton, Robert A.; Blevins, Donald W.; Davis, ED

    1994-01-01

    Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.

  8. LOW COMPLEXITY CONSTRAINTS FOR ENERGY AND PERFORMANCE MANAGEMENT OF HETEROGENEOUS MULTICORE PROCESSORS USING DYNAMIC OPTIMIZATION

    Directory of Open Access Journals (Sweden)

    A. S. Radhamani

    2014-01-01

    Full Text Available Optimization in multicore processor environment is significant in real world dynamic applications, as it is crucial to find and track the change effectively over time, which requires an optimization algorithm. In massively parallel processing multicore processor architectures, like other population based metaheuristics Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling can be effectively implemented. In this study we discuss possible approaches to parallelize CBFPSO in multicore system, which uses different constraints; to exploit parallelism are explored and evaluated. Due to the ability of keeping good balance between convergence and maintenance, for real world applications, among the various algorithms for parallel architecture optimization CBFPSOs are attracting more and more attentions in recent years. To tackle the challenges of parallel architecture optimization, several strategies have been proposed, to enhance the performance of Particle Swarm Optimization (PSO and have obtained success on various multicore parallel architecture optimization problems. But there still exist some issues in multicore architectures which require to be analyzed carefully. In this study, a new Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling for multicore architecture is proposed, which updates the velocity and position by two bacterial behaviours, i.e., reproduction and elimination dispersal. The performance of CBFPSO is compared with the simulation results of GA and the result shows that the proposed algorithm has pretty good performance on almost all types of cores compared to GA with respect to completion time and energy consumption.

  9. Multi-core Processors based Network Intrusion Detection Method

    Directory of Open Access Journals (Sweden)

    Ziqian Wan

    2012-09-01

    Full Text Available It is becoming increasingly hard to build an intrusion detection system (IDS, because of the higher traffic throughput and the rising sophistication of attacking. Scale will be an important issue to address in the intrusion detection area. For hardware, tomorrow’s performance gains will come from multi-core architectures in which a number of CPU executes concurrently. We take the advantage of multi-core processors’ full power for intrusion detection in this work. We present an intrusion detection system based on the Snort open-source IDS that exploits the computational power of MIPS multi-core architecture to offload the costly pattern matching operations from the CPU, and thus increase the system’s processing throughput. A preliminary experiment demonstrates the potential of this system. The experiment results indicate that this method can be used effectively to speed up intrusion detection systems.

  10. A lock circuit for a multi-core processor

    DEFF Research Database (Denmark)

    2015-01-01

    An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores that...

  11. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira

    2015-01-01

    The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation o...

  12. Adaptive data-driven parallelization of multi-view video coding on multi-core processor

    Institute of Scientific and Technical Information of China (English)

    PANG Yi; HU WeiDong; SUN LiFeng; YANG ShiQiang

    2009-01-01

    Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media, such as 3DTV and free viewpoint TV (FTV). However, even with mainstream computer manufacturers migrating to multi-core processors, the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper, we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband EngineTM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC, and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management, utilization of code locality and SIMD improvement. Decoding speed, speedup and utilization rate of cores are expressed in experimental results.

  13. Mixed-mode implementation of PETSc for scalable linear algebra on multi-core processors

    CERN Document Server

    Weiland, Michele; Gorman, Gerard; Kramer, Stephan; Parsons, Mark; Southern, James

    2012-01-01

    With multi-core processors a ubiquitous building block of modern supercomputers, it is now past time to enable applications to embrace these developments in processor design. To achieve exascale performance, applications will need ways of exploiting the new levels of parallelism that are exposed in modern high-performance computers. A typical approach to this is to use shared-memory programming techniques to best exploit multi-core nodes along with inter-node message passing. In this paper, we describe the addition of OpenMP threaded functionality to the PETSc library. We highlight some issues that hinder good performance of threaded applications on modern processors and describe how to negate them. The OpenMP branch of PETSc was benchmarked using matrices extracted from the Fluidity CFD application, which uses the library as its linear solver engine. The overall performance of the mixed-mode implementation is shown to be superior to that of the pure-MPI version.

  14. Tinuso: A processor architecture for a multi-core hardware simulation platform

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; Karlsson, Sven

    2010-01-01

    Multi-core systems have the potential to improve performance, energy and cost properties of embedded systems but also require new design methods and tools to take advantage of the new architectures. Due to the limited accuracy and performance of pure software simulators, we are working on a cycle...... accurate hardware simulation platform. We have developed the Tinuso processor architecture for this platform. Tinuso is a processor architecture optimized for FPGA implementation. The instruction set makes use of predicated instructions and supports C/C++ and assembly language programming. It is designed...... to be easy extendable to maintain the exibility required for the research on multi-core systems. Tinuso contains a co-processor interface to connect to a network interface. This interface allow for communication over an on-chip network. A clock frequency estimation study on a deeply pipelined Tinuso...

  15. Support for the Logical Execution Time Model on a Time-predictable Multicore Processor

    DEFF Research Database (Denmark)

    Kluge, Florian; Schoeberl, Martin; Ungerer, Theo

    2016-01-01

    The logical execution time (LET) model increases the compositionality of real-time task sets. Removal or addition of tasks does not influence the communication behavior of other tasks. In this work, we extend a multicore operating system running on a time-predictable multicore processor to support...... the LET model. For communication between tasks we use message passing on a time-predictable network-on-chip to avoid the bottleneck of shared memory. We report our experiences and present results on the costs in terms of memory and execution time....

  16. Early Student Support for Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

    Science.gov (United States)

    2016-05-07

    vehicle changes that can impact a deployment. The following sections explore the current state-of-the- art research in the areas described above and how...Student Support for Appl ication of Advanced Multi-Core Processor N00014-12-1-0298 Technologies to Oceanographic Research Sb. GRANT NUMBER Sc...and configuration based on resource availability. When a device is discovered that has failed or isn’t performing as expected, it is removed from

  17. MGSim - simulation tools for multi-core processor architectures

    NARCIS (Netherlands)

    Lankamp, M.; Poss, R.; Yang, Q.; Fu, J.; Uddin, I.; Jesshope, C.R.

    2013-01-01

    MGSim is an open source discrete event simulator for on-chip hardware components, developed at the University of Amsterdam. It is intended to be a research and teaching vehicle to study the fine-grained hardware/software interactions on many-core and hardware multithreaded processors. It includes su

  18. Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity

    Institute of Scientific and Technical Information of China (English)

    Zhi-xiang CHEN; Zhao-lin LI; Shan CAO; Fang WANG; Jie ZHOU

    2015-01-01

    Multi-core homogeneous processors have been widely used to deal with computation-intensive embed-ded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations, can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining (HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains. We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme, representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%.

  19. Dynamic Allocation of CPUs in Multicore Processor for Performance Improvement in Network Security Applications

    Directory of Open Access Journals (Sweden)

    Sudhakar Gummadi

    2011-01-01

    Full Text Available Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Increasing performance of the system by increasing the number of CPUs of the multicore processor for a given application warrants detailed experimentation. In this study, the results of the experimentation done by dynamic allocation/deallocation of the CPU based on the workload conditions for the packet processing for security application are analyzed and presented. Approach: This evaluation was conducted on SunfireT1000 server having Sun UltraSPARC T1 multicore processor. OpenMP tasking feature is used for scheduling the logical CPUs for the parallelized application. Dynamic allocation of a CPU to a process is done depending on the workload characterization. Results: Execution time for packet processing was analyzed to arrive at an effective dynamic allocation methodology that is dependant on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the methodology and the allocation of the number of CPUs for the parallelized application are suggested.

  20. Fast Forwarding with Network Processors

    OpenAIRE

    Lefèvre, Laurent; Lemoine, E.; Pham, C; Tourancheau, B.

    2003-01-01

    Forwarding is a mechanism found in many network operations. Although a regular workstation is able to perform forwarding operations it still suffers from poor performances when compared to dedicated hardware machines. In this paper we study the possibility of using Network Processors (NPs) to improve the capability of regular workstations to forward data. We present a simple model and an experimental study demonstrating that even though NPs are less powerful than Host Processors (HPs) they ca...

  1. Real-time 3-D ultrasound scan conversion using a multicore processor.

    Science.gov (United States)

    Zhuang, Bo; Shamdasani, Vijay; Sikdar, Siddhartha; Managuli, Ravi; Kim, Yongmin

    2009-07-01

    Real-time 3-D ultrasound scan conversion (SC) in software has not been practical due to its high computation and I/O data handling requirements. In this paper, we describe software-based 3-D SC with high volume rates using a multicore processor, Cell. We have implemented both 3-D SC approaches: 1) the separable 3-D SC where two 2-D coordinate transformations in orthogonal planes are performed in sequence and 2) the direct 3-D SC where the coordinate transformation is directly handled in 3-D. One Cell processor can scan-convert a 192 x 192 x 192 16-bit volume at 87.8 volumes/s with the separable 3-D SC algorithm and 28 volumes/s with the direct 3-D SC algorithm.

  2. Parallel computing of discrete element method on multi-core processors

    Institute of Scientific and Technical Information of China (English)

    Yusuke Shigeto; Mikio Sakai

    2011-01-01

    This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors.Recently,multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields.We propose a new algorithm for multi-thread parallel computation of DEM,which makes effective use of the available memory and accelerates the computation.This study shows that memory usage is drastically reduced by using this algorithm.To show the practical use of DEM in industry,a large-scale powder system is simulated with a complicated drive unit.We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor.The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm.In addition,DEM algorithm is shown to have high scalability in a multi-thread parallel computation on a CPU.

  3. Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Daihee Park

    2012-11-01

    Full Text Available In transmitting image/video data over Video Sensor Networks (VSNs, energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2~5 without compromising image/video quality.

  4. Leakage-Aware Reallocation for Periodic Real-Time Tasks on Multicore Processors

    CERN Document Server

    Huang, Hongtao; Wang, Jijie; Lei, Siyu; Wu, Guowei

    2010-01-01

    It is an increasingly important issue to reduce the energy consumption of computing systems. In this paper, we consider partition based energy-aware scheduling of periodic real-time tasks on multicore processors. The scheduling exploits dynamic voltage scaling (DVS) and core sleep scheduling to reduce both dynamic and leakage energy consumption. If the overhead of core state switching is non-negligible, however, the performance of this scheduling strategy in terms of energy efficiency might degrade. To achieve further energy saving, we extend the static task scheduling with run-time task reallocation. The basic idea is to aggregate idle time among cores so that as many cores as possible could be put into sleep in a way that the overall energy consumption is reduced. Simulation results show that the proposed approach results in up to 20% energy saving over traditional leakage-aware DVS.

  5. ESAIR: A Behavior-Based Robotic Software Architecture on Multi-Core Processor Platforms

    Directory of Open Access Journals (Sweden)

    Chin-Yuan Tseng

    2013-03-01

    Full Text Available This paper introduces an Embedded Software Architecture for Intelligent Robot systems (ESAIR that addresses the issues of parallel thread executions on multi-core processor platforms. ESAIR provides a thread scheduling interface to improve the execution performance of a robot system by assigning a dedicated core to a running thread on the fly and dynamically rescheduling the priority of the thread. In the paper, we describe the object-oriented design and the control functions of ESAIR. The modular design of ESAIR helps improve the software quality, reliability and scalability in research and real practice. We prove the improvement by realizing ESAIR on an autonomous robot, named AVATAR. AVATAR implements various human-robot interactions, including speech recognition, human following, face recognition, speaker identification, etc. With the support of ESAIR, AVATAR can integrate a comprehensive set of behaviors and peripherals with better resource utilization.

  6. Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

    Science.gov (United States)

    Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

    2017-05-01

    In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.

  7. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    Energy Technology Data Exchange (ETDEWEB)

    Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core

  8. Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

    Energy Technology Data Exchange (ETDEWEB)

    Schmalz, Mark S

    2011-07-24

    Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G} for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient

  9. The ATLAS fast tracker processor design

    CERN Document Server

    Volpi, Guido; Albicocco, Pietro; Alison, John; Ancu, Lucian Stefan; Anderson, James; Andari, Nansi; Andreani, Alessandro; Andreazza, Attilio; Annovi, Alberto; Antonelli, Mario; Asbah, Needa; Atkinson, Markus; Baines, J; Barberio, Elisabetta; Beccherle, Roberto; Beretta, Matteo; Biesuz, Nicolo Vladi; Blair, R E; Bogdan, Mircea; Boveia, Antonio; Britzger, Daniel; Bryant, Partick; Burghgrave, Blake; Calderini, Giovanni; Camplani, Alessandra; Cavaliere, Viviana; Cavasinni, Vincenzo; Chakraborty, Dhiman; Chang, Philip; Cheng, Yangyang; Citraro, Saverio; Citterio, Mauro; Crescioli, Francesco; Dawe, Noel; Dell'Orso, Mauro; Donati, Simone; Dondero, Paolo; Drake, G; Gadomski, Szymon; Gatta, Mauro; Gentsos, Christos; Giannetti, Paola; Gkaitatzis, Stamatios; Gramling, Johanna; Howarth, James William; Iizawa, Tomoya; Ilic, Nikolina; Jiang, Zihao; Kaji, Toshiaki; Kasten, Michael; Kawaguchi, Yoshimasa; Kim, Young Kee; Kimura, Naoki; Klimkovich, Tatsiana; Kolb, Mathis; Kordas, K; Krizka, Karol; Kubota, T; Lanza, Agostino; Li, Ho Ling; Liberali, Valentino; Lisovyi, Mykhailo; Liu, Lulu; Love, Jeremy; Luciano, Pierluigi; Luongo, Carmela; Magalotti, Daniel; Maznas, Ioannis; Meroni, Chiara; Mitani, Takashi; Nasimi, Hikmat; Negri, Andrea; Neroutsos, Panos; Neubauer, Mark; Nikolaidis, Spiridon; Okumura, Y; Pandini, Carlo; Petridou, Chariclia; Piendibene, Marco; Proudfoot, James; Rados, Petar Kevin; Roda, Chiara; Rossi, Enrico; Sakurai, Yuki; Sampsonidis, Dimitrios; Saxon, James; Schmitt, Stefan; Schoening, Andre; Shochet, Mel; Shoijaii, Jafar; Soltveit, Hans Kristian; Sotiropoulou, Calliope-Louisa; Stabile, Alberto; Swiatlowski, Maximilian J; Tang, Fukun; Taylor, Pierre Thor Elliot; Testa, Marianna; Tompkins, Lauren; Vercesi, V; Wang, Rui; Watari, Ryutaro; Zhang, Jianhong; Zeng, Jian Cong; Zou, Rui; Bertolucci, Federico

    2015-01-01

    The extended use of tracking information at the trigger level in the LHC is crucial for the trigger and data acquisition (TDAQ) system to fulfill its task. Precise and fast tracking is important to identify specific decay products of the Higgs boson or new phenomena, as well as to distinguish the contributions coming from the many collisions that occur at every bunch crossing. However, track reconstruction is among the most demanding tasks performed by the TDAQ computing farm; in fact, complete reconstruction at full Level-1 trigger accept rate (100 kHz) is not possible. In order to overcome this limitation, the ATLAS experiment is planning the installation of a dedicated processor, the Fast Tracker (FTK), which is aimed at achieving this goal. The FTK is a pipeline of high performance electronics, based on custom and commercial devices, which is expected to reconstruct, with high resolution, the trajectories of charged-particle tracks with a transverse momentum above 1 GeV, using the ATLAS inner tracker info...

  10. Investigation of hadron matter using lattice QCD and implementation of lattice QCD applications on heterogeneous multicore acceleration processors

    Energy Technology Data Exchange (ETDEWEB)

    Winter, Frank

    2011-07-01

    Observables relevant for the understanding of the structure of baryons were determined by means of Monte Carlo simulations of lattice Quantum Chromodynamics (QCD) using 2+1 dynamical quark flavours. Special emphasis was placed on how these observables change when flavour symmetry is broken in comparison to choosing equal masses for the two light and the strange quark. The first two moments of unpolarised, longitudinally, and transversely polarised parton distribution functions were calculated for the nucleon and hyperons. Modern lattice QCD simulations require petaflop computing and beyond, a regime of computing power we just reach today. Heterogeneous multicore computing is getting increasingly important in high performance computing and allows for deploying multiple types of processing elements within a single workflow. In this work new design concepts were developed for an active library (QDP++) exploiting the compute power of a heterogeneous multicore processor (IBM PowerXCell 8i processor). It was possible to run a QDP++ based physics application (Chroma) on an IBM BladeCenter QS22. (orig.)

  11. 3D Seismic Imaging through Reverse-Time Migration on Homogeneous and Heterogeneous Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Mauricio Araya-Polo

    2009-01-01

    Full Text Available Reverse-Time Migration (RTM is a state-of-the-art technique in seismic acoustic imaging, because of the quality and integrity of the images it provides. Oil and gas companies trust RTM with crucial decisions on multi-million-dollar drilling investments. But RTM requires vastly more computational power than its predecessor techniques, and this has somewhat hindered its practical success. On the other hand, despite multi-core architectures promise to deliver unprecedented computational power, little attention has been devoted to mapping efficiently RTM to multi-cores. In this paper, we present a mapping of the RTM computational kernel to the IBM Cell/B.E. processor that reaches close-to-optimal performance. The kernel proves to be memory-bound and it achieves a 98% utilization of the peak memory bandwidth. Our Cell/B.E. implementation outperforms a traditional processor (PowerPC 970MP in terms of performance (with an 15.0× speedup and energy-efficiency (with a 10.0× increase in the GFlops/W delivered. Also, it is the fastest RTM implementation available to the best of our knowledge. These results increase the practical usability of RTM. Also, the RTM-Cell/B.E. combination proves to be a strong competitor in the seismic arena.

  12. Analyzing the trade-off between multiple memory controllers and memory channels on multi-core processor performance

    Energy Technology Data Exchange (ETDEWEB)

    Sancho Pitarch, Jose Carlos [Los Alamos National Laboratory; Kerbyson, Darren [Los Alamos National Laboratory; Lang, Mike [Los Alamos National Laboratory

    2010-01-01

    Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for a wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.

  13. Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

    CERN Document Server

    Wittmann, Markus; Treibig, Jan; Wellein, Gerhard

    2010-01-01

    Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three current x86-based microprocessors, showing clearly that our optimization works best on designs with high-speed shared caches and low memory bandwidth per core. We furthermore demonstrate that simple bandwidth-based performance models are inaccurate for this kind of algorithm and employ a more elaborate, synthetic modeling procedure. Finally we show that temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment, albeit with limited benefit at strong scaling.

  14. A Parallel and Concurrent Implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Traveling Salesman Problem for Multi-Core Processors using SPC3 Programming Model

    Directory of Open Access Journals (Sweden)

    Muhammad Ali Ismail

    2011-08-01

    Full Text Available With the arrival of multi-cores, every processor has now built-in parallel computational power and that can be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core processors. In this paper we have presented a combined parallel and concurrent implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Travelling Salesman Problem (TSP using a newly developed parallel programming model, SPC3 PM, for general purpose multi-core processors. This implementation is found to be very simple, highly efficient, scalable and less time consuming in compare to the existing LKH-2 serial implementations in multi-core processing environment. We have tested our parallel implementation of LKH-2 with medium and large size TSP instances of TSBLIB. And for all these tests our proposed approach has shown much improved performance and scalability.

  15. Design of Clustered NoC for Multi-core Processor%适用于多核处理器的簇状片上网络设计

    Institute of Scientific and Technical Information of China (English)

    尤凯迪; 肖瑞瑾; 权衡; 虞志益

    2011-01-01

    This paper presents a novel clustered Network-on-Chip(NoC) architecture for multi-core processor. The proposed architecture is a cluster array organized as a two-dimensional mesh. Each cluster includes three processors, one Direct Memory Access(DMA) and one cluster shared memory. The multi-core processor using such NoC architecture can obtain high communication efficiency and memory utilization ratio. It design a four-cluster prototype system and implement the 3 780-point Fast Fourier Transform(FFT) on it. In FFT application, the memory utilization ratio increases to 79.5%.%提出一种新型簇状片上网络架构.该架构以二维网状拓扑结构连接各个簇单元,每个簇单元由3个处理器、1个直接访存单元和1个簇共享存储单元组成.基于该架构的多核处理器可以获得更高的通信效率及存储器利用率.在实验系统上实现3 780点的快速傅里叶变换,结果表明,在快速傅里叶变换应用中存储器的利用率能提升至79.5%.

  16. Fast 2D-DCT implementations for VLIW processors

    OpenAIRE

    Sohm, OP; Canagarajah, CN; Bull, DR

    1999-01-01

    This paper analyzes various fast 2D-DCT algorithms regarding their suitability for VLIW processors. Operations for truncation or rounding which are usually neglected in proposals for fast algorithms have also been taken into consideration. Loeffler's algorithm with parallel multiplications was found to be most suitable due to its parallel structure

  17. Study of Various Factors Affecting Performance of Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Nitin Chaturvedi

    2013-07-01

    Full Text Available Advances in Integrated Circuit processing allow for more microprocessor design options. As ChipMultiprocessor system (CMP become the predominant topology for leading microprocessors, criticalcomponents of the system are now integrated on a single chip. This enables sharing of computationresources that was not previously possible. In addition the virtualization of these computation resourcesexposes the system to a mix of diverse and competing workloads. On chip Cache memory is a resource ofprimary concern as it can be dominant in controlling overall throughput. This Paper presents analysis ofvarious parameters affecting the performance of Multi-core Architectures like varying the number ofcores, changes L2 cache size, further we have varied directory size from 64 to 2048 entries on a 4 node, 8node 16 node and 64 node Chip multiprocessor which in turn presents an open area of research on multicoreprocessors with private/shared last level cache as the future trend seems to be towards tiledarchitecture executing multiple parallel applications with optimized silicon area utilization and excellentperformance.

  18. 基于多核处理器BFD协议的设计与实现%Design and implementation of BFD protocol based on multi-core processor

    Institute of Scientific and Technical Information of China (English)

    邓嘉; 吉萌; 雷升平

    2016-01-01

    BFD is a bidrectional forwarding fast detection. To solve the problem of slow-forward detection,this paper presents and implements BFD protocol on a multi-core processor platform.sending and processing packets in this mechanism are implements by bottom-driver.bottom-driver receive and send commands through notification to up level . A session is find and matched by hash value.The experiment prove that the time of fast-forward detection is about 20 ms,and it can satisfy the request of rooter.%BFD是一种双向转发快速检测机制,为解决协议软件BFD在链路检测中响应慢的问题,本文提出并实现了一种在多核处理器平台下基于底层驱动实现的BFD机制。该机制下所有的收发包处理都由底层驱动实现,上层只负责向底层下发配置命令和接受底层的通告信息;会话表中通过哈希算法查找相应的会话并对相关字段进行匹配。实验证明,该机制的链路检测响应时间可达20毫秒左右,满足高性能网络设备可靠性的要求。

  19. The fast tracker processor for hadron collider triggers

    CERN Document Server

    Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, E; Pietri, M; Varotto, G

    2001-01-01

    Perspectives for precise and fast track reconstruction in future hadron collider experiments are addressed. We discuss the feasibility of a pipelined highly parallel processor dedicated to the implementation of a very fast tracking algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points, called patterns, for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at an input rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution tracks with transverse momentum above a few GeV and search for secondary vertices within typical level-2 times. (15 refs).

  20. FastFlow: Efficient Parallel Streaming Applications on Multi-core

    CERN Document Server

    Aldinucci, Marco; Meneghin, Massimiliano

    2009-01-01

    Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein P01111 against UniP...

  1. Hardware design and implementation of fast DOA estimation method based on multicore DSP

    Science.gov (United States)

    Guo, Rui; Zhao, Yingxiao; Zhang, Yue; Lin, Qianqiang; Chen, Zengping

    2016-10-01

    In this paper, we present a high-speed real-time signal processing hardware platform based on multicore digital signal processor (DSP). The real-time signal processing platform shows several excellent characteristics including high performance computing, low power consumption, large-capacity data storage and high speed data transmission, which make it able to meet the constraint of real-time direction of arrival (DOA) estimation. To reduce the high computational complexity of DOA estimation algorithm, a novel real-valued MUSIC estimator is used. The algorithm is decomposed into several independent steps and the time consumption of each step is counted. Based on the statistics of the time consumption, we present a new parallel processing strategy to distribute the task of DOA estimation to different cores of the real-time signal processing hardware platform. Experimental results demonstrate that the high processing capability of the signal processing platform meets the constraint of real-time direction of arrival (DOA) estimation.

  2. 多核处理器功耗优化与评估技术发展综述%Overview on power optimization and evaluation technology of multi-core processor

    Institute of Scientific and Technical Information of China (English)

    邢立冬

    2014-01-01

    Multi-core processors have become the mainstream of the current processor design, and its parallel processing capability significantly improves the performance of the processor, while the highly integrated multi-core processor itself also makes a significant increase in the degree of power, which to some extent limit the multicore processor development. This paper describes the basic theory of low-power design, commonly used low-power design techniques and multi-core processor power estimation techniques, and summarizes the latest developments in the study of low-power multi-core processors. This article can be a useful reference for the design of multi-core processors.%多核处理器已经成为当前处理器设计的主流,其并行处理能力显著提高了处理器的性能,同时,多核处理器本身的高度集成度也使其功耗显著上升,从而在一定程度上限制了多核处理器的发展。本文描述了低功耗设计的基本理论、常用的低功耗设计技术和多核处理器中的功耗评估技术,并分析和总结了低功耗多核处理器研究的最新进展,可为多核处理器的设计提供有益的参考。

  3. THE PROGRAM OPTIMAZITION'S METHOD OF MULTI-CORE PROCESSOR%基于多核处理器的程序性能优化方法

    Institute of Scientific and Technical Information of China (English)

    昌杰

    2012-01-01

    The multi-core processor integrates multiple processor cores on a single chip which can execute multiple threads si- muhaneously, Although the clock frequency of each processor does not increase any more, the computation which is provided by parallel procession of multiple processors is more powerful than that of single processor, design complexity of CPU promotes radically at the same time. Based on design technologies of modem processor, this paper investigates several methods to opti- mize program performance by combination with implementation procedure of program; therefore, they can enhance the quality of programming and improve execution efficiency.%多核处理器即在一个处理器芯片上集成多个处理器核心,可同时执行多个线程。虽然多核处理器中每个核的时钟频率没有增加,但多个核的并行处理提供了远比单核强大的计算能力,同时,也大大提高了CPU的设计复杂性,基于现代处理器的设计技术,结合程序的实现过程,探讨优化程序性能的几种方法,提升程序编写质量,提高执行效率。

  4. Recovery Act: Integrated DC-DC Conversion for Energy-Efficient Multicore Processors

    Energy Technology Data Exchange (ETDEWEB)

    Shepard, Kenneth L

    2013-03-31

    devices. These new approaches to scaled voltage regulation for computing devices also promise significant impact on electricity consumption in the United States and abroad by improving the efficiency of all computational platforms. In 2006, servers and datacenters in the United States consumed an estimated 61 billion kWh or about 1.5% of the nation's total energy consumption. Federal Government servers and data centers alone accounted for about 10 billion kWh, for a total annual energy cost of about $450 million. Based upon market growth and efficiency trends, estimates place current server and datacenter power consumption at nearly 85 billion kWh in the US and at almost 280 billion kWh worldwide. Similar estimates place national desktop, mobile and portable computing at 80 billion kWh combined. While national electricity utilization for computation amounts to only 4% of current usage, it is growing at a rate of about 10% a year with volume servers representing one of the largest growth segments due to the increasing utilization of cloud-based services. The percentage of power that is consumed by the processor in a server varies but can be as much as 30% of the total power utilization, with an additional 50% associated with heat removal. The approaches considered here should allow energy efficiency gains as high as 30% in processors for all computing platforms, from high-end servers to smart phones, resulting in a direct annual energy savings of almost 15 billion kWh nationally, and 50 billion kWh globally. The work developed here is being commercialized by the start-up venture, Ferric Semiconductor, which has already secured two Phase I SBIR grants to bring these technologies to the marketplace.

  5. Fundamentals of multicore software development

    CERN Document Server

    Pankratius, Victor; Tichy, Walter F

    2011-01-01

    With multicore processors now in every computer, server, and embedded device, the need for cost-effective, reliable parallel software has never been greater. By explaining key aspects of multicore programming, Fundamentals of Multicore Software Development helps software engineers understand parallel programming and master the multicore challenge. Accessible to newcomers to the field, the book captures the state of the art of multicore programming in computer science. It covers the fundamentals of multicore hardware, parallel design patterns, and parallel programming in C++, .NET, and Java. It

  6. Fast Fourier Transform Co-Processor (FFTC)- Towards Embedded GFLOPs

    Science.gov (United States)

    Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Wite, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland

    2012-08-01

    Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co- Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment.In frame of the ESA activity “Fast Fourier Transform DSP Co-processor (FFTC)” (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following:Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP.The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance.The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT- based processing tasks.A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses.The presentation will give and overview on the project, including the results of the validation of the FFTC ASIC prototypes.

  7. Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs

    Science.gov (United States)

    Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Witte, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland; Kopp, Nicholas

    2012-10-01

    Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co-Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment. In frame of the ESA activity "Fast Fourier Transform DSP Co-processor (FFTC)" (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following: • Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP. • The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance. The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT-based processing tasks. A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses. The paper will give an overview on the project, including the results of the validation of the FFTC ASIC prototypes.

  8. A Fast CT Reconstruction Scheme for a General Multi-Core PC

    Directory of Open Access Journals (Sweden)

    Kai Zeng

    2007-01-01

    Full Text Available Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT angiography (BCA that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

  9. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...

  10. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...

  11. Micro mirrors based coupling of light to multi-core fiber realizing in-fiber photonic neural network processor

    Science.gov (United States)

    Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; London, Michael; Zalevsky, Zeev

    2017-02-01

    Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a DMD based approaches to realize energetically efficient light coupling into a multi-core fiber realizing a unique design for in-fiber optical neural networks. Neurons and synapses are realized as individual cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in Erbium-doped cores mimics synaptic interactions. In order to dynamically and efficiently couple light into the multi-core fiber a DMD based micro mirror device is used to perform proper beam shaping operation. The beam shaping reshapes the light into a large set of points in space matching the positions of the required cores in the entrance plane to the multi-core fiber.

  12. Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

    Directory of Open Access Journals (Sweden)

    Pielot Rainer

    2010-01-01

    Full Text Available Abstract Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE, a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.

  13. Fast computation of myelin maps from MRI T₂ relaxation data using multicore CPU and graphics card parallelization.

    Science.gov (United States)

    Yoo, Youngjin; Prasloski, Thomas; Vavasour, Irene; MacKay, Alexander; Traboulsee, Anthony L; Li, David K B; Tam, Roger C

    2015-03-01

    To develop a fast algorithm for computing myelin maps from multiecho T2 relaxation data using parallel computation with multicore CPUs and graphics processing units (GPUs). Using an existing MATLAB (MathWorks, Natick, MA) implementation with basic (nonalgorithm-specific) parallelism as a guide, we developed a new version to perform the same computations but using C++ to optimize the hybrid utilization of multicore CPUs and GPUs, based on experimentation to determine which algorithmic components would benefit from CPU versus GPU parallelization. Using 32-echo T2 data of dimensions 256 × 256 × 7 from 17 multiple sclerosis patients and 18 healthy subjects, we compared the two methods in terms of speed, myelin values, and the ability to distinguish between the two patient groups using Student's t-tests. The new method was faster than the MATLAB implementation by 4.13 times for computing a single map and 14.36 times for batch-processing 10 scans. The two methods produced very similar myelin values, with small and explainable differences that did not impact the ability to distinguish the two patient groups. The proposed hybrid multicore approach represents a more efficient alternative to MATLAB, especially for large-scale batch processing. © 2014 Wiley Periodicals, Inc.

  14. Multithreading for Embedded Reconfigurable Multicore Systems

    NARCIS (Netherlands)

    Zaykov, P.G.

    2014-01-01

    In this dissertation, we address the problem of performance efficient multithreading execution on heterogeneous multicore embedded systems. By heterogeneous multicore embedded systems we refer to those, which have real-time requirements and consist of processor tiles with General Purpose Processor

  15. Compiler for Fast, Accurate Mathematical Computing on Integer Processors Project

    Data.gov (United States)

    National Aeronautics and Space Administration — The proposers will develop a computer language compiler to enable inexpensive, low-power, integer-only processors to carry our mathematically-intensive comptutations...

  16. Fast Parallel Computation of Polynomials Using Few Processors

    DEFF Research Database (Denmark)

    Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.;

    1983-01-01

    It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors.......It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....

  17. Fast parallel computation of polynomials using few processors

    DEFF Research Database (Denmark)

    Valiant, Leslie; Skyum, Sven

    1981-01-01

    It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors.......It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....

  18. Scalable and Flexible heterogeneous multi-core system

    Directory of Open Access Journals (Sweden)

    Rashmi A Jain

    2013-01-01

    Full Text Available Multi-core system has wide utility in today’s applications due to less power consumption and high performance. Many researchers are aiming at improving the performance of these systems by providing flexible multi-core architecture. Flexibility in the multi-core processors system provides high throughput for uniform parallel applications as well as high performance for more general work. This flexibility in the architecture can be achieved by scalable and changeablesize window micro architecture. It uses the concept of execution locality to provide large-window capabilities. Use of high memory-level parallelism (MLP reduces the memory wall. Micro architecture contains a set of small and fast cache processors which execute high locality code. A network of small in-order memory engines use low locality code to improve performance by using instruction level parallelism (ILP. Dynamic heterogeneous multi-core architecture is capable of reconfiguring itself to fit application requirements. Study of different scalable and flexible architectures of heterogeneous multi-core system has been carried out and has been presented.

  19. 多核处理器共享Cache低功耗可重构方法%Low Power Shared Cache Reconfigurable Method for Multicore Processor

    Institute of Scientific and Technical Information of China (English)

    方娟; 雷鼎

    2013-01-01

    为了降低整个处理器的功耗,分析了当前多核Cache低功耗技术,并提出一种面向多核共享Cache低功耗的重构方法.在共享Cache上进行静态重构,分析了Cache重构的必要性,然后在Cache访问的过程中加入重构策略.实验结果证明:在性能平均损失4%的情况下,功耗平均降低了18%左右.%To reduce the power consumption,the technology about cache low power was analysed,and a shared cache reconfigurable method in the multi-core processors (CMP) for low power was proposed.First,the cache reconfiguration necessity by the static reconfiguration was analysed; Then,the reconfigurable strategy during cache access was added.Result shows that the method can save about 18% energy consumption,while the performance decline is only 4%.

  20. 多核处理器并行程序的确定性重放研究∗%Deterministic Replay for Parallel Programs in Multi-Core Processors

    Institute of Scientific and Technical Information of China (English)

    高岚; 王锐; 钱德沛

    2013-01-01

      多核处理器并行程序的确定性重放是实现并行程序调试的有效手段,对并行编程有重要意义。但由于多核架构下存在共享访存不同步问题,并行程序确定性重放的研究依然面临多方面的挑战,给并行程序的调试带来很大困难,严重影响了多核架构下并行程序的普及和发展。分析了多核处理器造成并行程序确定性重放难以实现的关键因素,总结了确定性重放的评价指标,综述了近年来学术界对并行程序确定性重放的研究。根据总结的评价指标,从纯软件方式和硬件支持方式对目前的确定性重放方法进行了分析与对比,并在此基础上对多核架构下并行程序的确定性重放未来的研究趋势和应用前景进行了展望。%The deterministic replay for parallel programs in multi-core processor systems is important for the debugging and dissemination of parallel programs, however, due to the difficulty in tackling unsynchronized accessing of shared memory in multiprocessors, industrial-level deterministic replay for parallel programs have not emerged yet. This paper analyzes non-deterministic events in multi-core processor systems and summarizes metrics of deterministic replay schemes. After studying the research for deterministic multi-core processor replay in recent years, this paper introduces the proposed deterministic replay schemes for parallel programs in multi-core processor systems, investigates characteristics of software-pure and hardware-assisted deterministic replay schemes, analyzes current researches and gives the prospects of deterministic replay for parallel programs in multi-core processor systems.

  1. Fast normal random number generators on vector processors

    OpenAIRE

    Brent, Richard P.

    2010-01-01

    We consider pseudo-random number generators suitable for vector processors. In particular, we describe vectorised implementations of the Box-Muller and Polar methods, and show that they give good performance on the Fujitsu VP2200. We also consider some other popular methods, e.g. the Ratio method of Kinderman and Monahan (1977) (as improved by Leva (1992)), and the method of Von Neumann and Forsythe, and show why they are unlikely to be competitive with the Polar method on vector processors.

  2. SWIFT: Fast algorithms for multi-resolution SPH on multi-core architectures

    CERN Document Server

    Gonnet, Pedro; Theuns, Tom; Chalk, Aidan B G

    2013-01-01

    This paper describes a novel approach to neighbour-finding in Smoothed Particle Hydrodynamics (SPH) simulations with large dynamic range in smoothing length. This approach is based on hierarchical cell decompositions, sorted interactions, and a task-based formulation. It is shown to be faster than traditional tree-based codes, and to scale better than domain decomposition-based approaches on shared-memory parallel architectures such as multi-cores.

  3. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Cipriani, R; Citraro, S; Citterio, M; Colombo, A; Crescioli, F; Dimas, D; Donati, S; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A

    2014-01-01

    The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. Special relevance will be given to the AMchip design that includes two custom cells optimized for low consumption. We repo...

  4. Graphical programming harnesses multicore PC technology

    Institute of Scientific and Technical Information of China (English)

    National Instruments

    2007-01-01

    @@ Multicore processing is generating considerable buzz within the PC industry,largely because both Intel and AMD have released initial versions of their own multicore processors.These first multicore processors contain two cores,or computing engines,located in one physical processor-hence the name dual-core processors.Processors with more than two cores also are on the horizon.Dualcore processors can simultaneously execute two computing tasks.This is advantageous in multitasking environments,such as Windows XP,in which you simultaneously run multiple applications.Two applications-National instruments LabVIEW and Microsoft Excel,for exampleeach can access a separate processor core at the same time,thus improving overall performance for applications such as data logging.

  5. Multicore Considerations for Legacy Flight Software Migration

    Science.gov (United States)

    Vines, Kenneth; Day, Len

    2013-01-01

    In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.

  6. Multicore Considerations for Legacy Flight Software Migration

    Science.gov (United States)

    Vines, Kenneth; Day, Len

    2013-01-01

    In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.

  7. XOP: a second generation fast processor for on-line use in high energy physics experiments

    CERN Document Server

    Lingjaerde, Tor

    1981-01-01

    Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fast execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor "XOP" is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data fi...

  8. Multi-Core Cache Hierarchies

    CERN Document Server

    Balasubramonian, Rajeev

    2011-01-01

    A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip memory access by improving the efficiency of the on-chip cache. Future multi-core processors will have many large cache banks connected by a network and shared by many cores. Hence, many important problems must be solved: cache resources must be allocat

  9. Associative Memory Design for the FastTrack Processor (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Volpi, G; Beccherle, R; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Amerio, S; Hoff, J; Liu, T; Sacco, I; Liberali, V; Stabile, A; Schoening, A; Soltveit, H; Tripiccione, R

    2011-01-01

    We propose a new generation of VLSI processor for pattern recognition based on Associative Memory architecture, optimized for on-line track finding in high-energy physics experiments. We describe the architecture, the technology studies and the prototype design of a new Associative Memory project: it maximizes the pattern density on ASICs, minimizes the power consumption and improves the functionality for the fast tracker processor proposed to upgrade the ATLAS trigger at LHC. Finally we will focus on possible future applications inside and outside high physics energy.

  10. A Fast DCT Algorithm for Watermarking in Digital Signal Processor

    Directory of Open Access Journals (Sweden)

    S. E. Tsai

    2017-01-01

    Full Text Available Discrete cosine transform (DCT has been an international standard in Joint Photographic Experts Group (JPEG format to reduce the blocking effect in digital image compression. This paper proposes a fast discrete cosine transform (FDCT algorithm that utilizes the energy compactness and matrix sparseness properties in frequency domain to achieve higher computation performance. For a JPEG image of 8×8 block size in spatial domain, the algorithm decomposes the two-dimensional (2D DCT into one pair of one-dimensional (1D DCTs with transform computation in only 24 multiplications. The 2D spatial data is a linear combination of the base image obtained by the outer product of the column and row vectors of cosine functions so that inverse DCT is as efficient. Implementation of the FDCT algorithm shows that embedding a watermark image of 32 × 32 block pixel size in a 256 × 256 digital image can be completed in only 0.24 seconds and the extraction of watermark by inverse transform is within 0.21 seconds. The proposed FDCT algorithm is shown more efficient than many previous works in computation.

  11. Fast space-filling molecular graphics using dynamic partitioning among parallel processors.

    Science.gov (United States)

    Gertner, B J; Whitnell, R M; Wilson, K R

    1991-09-01

    We present a novel algorithm for the efficient generation of high-quality space-filling molecular graphics that is particularly appropriate for the creation of the large number of images needed in the animation of molecular dynamics. Each atom of the molecule is represented by a sphere of an appropriate radius, and the image of the sphere is constructed pixel-by-pixel using a generalization of the lighting model proposed by Porter (Comp. Graphics 1978, 12, 282). The edges of the spheres are antialiased, and intersections between spheres are handled through a simple blending algorithm that provides very smooth edges. We have implemented this algorithm on a multiprocessor computer using a procedure that dynamically repartitions the effort among the processors based on the CPU time used by each processor to create the previous image. This dynamic reallocation among processors automatically maximizes efficiency in the face of both the changing nature of the image from frame to frame and the shifting demands of the other programs running simultaneously on the same processors. We present data showing the efficiency of this multiprocessing algorithm as the number of processors is increased. The combination of the graphics and multiprocessor algorithms allows the fast generation of many high-quality images.

  12. Multicore technology architecture, reconfiguration, and modeling

    CERN Document Server

    Qadri, Muhammad Yasir

    2013-01-01

    The saturation of design complexity and clock frequencies for single-core processors has resulted in the emergence of multicore architectures as an alternative design paradigm. Nowadays, multicore/multithreaded computing systems are not only a de-facto standard for high-end applications, they are also gaining popularity in the field of embedded computing. The start of the multicore era has altered the concepts relating to almost all of the areas of computer architecture design, including core design, memory management, thread scheduling, application support, inter-processor communication, debu

  13. Inter-processor communication method of TMS320C6678 multicore DSP%TMS320C6678多核DSP的核间通信方法

    Institute of Scientific and Technical Information of China (English)

    吴灏; 肖吉阳; 范红旗; 付强

    2012-01-01

    嵌入式应用中采用多处理系统所面临的主要难题是多处理器内核之间的通信.对Key-Stone架构TMS320C6678处理器的多核间通信机制进行研究,利用处理器间中断和核间通信寄存器,设计并实现了多核之间的通信.从系统的角度出发,设计与仿真了两种多核通信拓扑结构,并分析对比了性能.对设计多核DSP处理器的核间通信有一定的指导价值.%Inter-processor communication is the main problem of chip multi-processor system. Based on the study of the inter-processor interrupt and the inter-processor communication registers, the inter-processor communication mechanism of TMS320C6678 multi-processor is analyzed, and two topological structures of inter-processor communication are compared. Some reference value is provided for designing inter梡rocessor communication.

  14. SAD PROCESSOR FOR MULTIPLE MACROBLOCK MATCHING IN FAST SEARCH VIDEO MOTION ESTIMATION

    Directory of Open Access Journals (Sweden)

    Nehal N. Shah

    2015-02-01

    Full Text Available Motion estimation is a very important but computationally complex task in video coding. Process of determining motion vectors based on the temporal correlation of consecutive frame is used for video compression. In order to reduce the computational complexity of motion estimation and maintain the quality of encoding during motion compensation, different fast search techniques are available. These block based motion estimation algorithms use the sum of absolute difference (SAD between corresponding macroblock in current frame and all the candidate macroblocks in the reference frame to identify best match. Existing implementations can perform SAD between two blocks using sequential or pipeline approach but performing multi operand SAD in single clock cycle with optimized recourses is state of art. In this paper various parallel architectures for computation of the fixed block size SAD is evaluated and fast parallel SAD architecture is proposed with optimized resources. Further SAD processor is described with 9 processing elements which can be configured for any existing fast search block matching algorithm. Proposed SAD processor consumes 7% fewer adders compared to existing implementation for one processing elements. Using nine PE it can process 84 HD frames per second in worse case which is good outcome for real time implementation. In average case architecture process 325 HD frames per second.

  15. 多核处理器中任务调度与负载均衡的研究%Task Allocation and Load Balance on Multi-core Processors

    Institute of Scientific and Technical Information of China (English)

    彭蔓蔓; 黄亮

    2011-01-01

    针对多核处理器系统的特点,对任务分配及调度模型进行改进,提高各处理器相对均衡负载度,并在此基础上提出一种均衡种群遗传算法(BPGA).算法在任务节点的高度约束条件下,达到任务节点在处理核上随机分配,而任务节点数均衡分配.采用随机生成图法进行模拟实验,与其他算法相比,BPGA算法有更小的调度长度和较少的执行时间.%According to the characteristics of multi-core processor system, this paper improves task allocation and scheduling model to improve the relative balance of the load on the processors, and puts forward a balanced popula- tion genetic algorithm (BPGA). The algorithm under the constraints of the height of task nodes can allocate task nodes randomly and the number of nodes balanced on each processing core. The experiment used randomly generated DAG graph, compared with other algorithms, the BPGA has less makespan and less execution time.

  16. Associative Memory Design for the Fast TracKer Processor (FTK) at ATLAS

    CERN Document Server

    Beretta, M; The ATLAS collaboration

    2011-01-01

    We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. We have developed this device using 65 nm technology combining a full custom CAM cell with standard-cell control logic. The customized design maximizes the pattern density, minimizes the power consumption and implements the functionalities needed for the planned Fast Tracker (FTK) [2], an ATLAS trigger upgrade project at LHC. We introduce a new variable resolution pattern matching technique using “don’t care” bits to set the pattern-matching window for each pattern and each layer can be independently.

  17. Associative Memory design for the FastTrack processor (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Piendibene, M; Sacco, I; Sartori, L; Tripiccione, R

    2010-01-01

    We propose a new generation of VLSI processor for pattern recognition based on Associative Memory architecture, optimized for on-line track finding in high-energy physics experiments. We describe the architecture, the technology studies and the prototype design of a new R&D Associative Memory project: it maximizes the pattern density on ASICs, minimizes the power consumption and improves the functionality for the Fast Tracker (FTK) proposed to upgrade the ATLAS trigger at LHC. Finally we will focus on possible future applications inside and outside High Physics Energy (HEP).

  18. Fast Optimal Replica Placement with Exhaustive Search Using Dynamically Reconfigurable Processor

    Directory of Open Access Journals (Sweden)

    Hidetoshi Takeshita

    2011-01-01

    Full Text Available This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN. It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.

  19. CMS multicore scheduling strategy

    Energy Technology Data Exchange (ETDEWEB)

    Perez-Calero Yzquierdo, Antonio [Madrid, CIEMAT; Hernandez, Jose [Madrid, CIEMAT; Holzman, Burt [Fermilab; Majewski, Krista [Fermilab; McCrea, Alison [UC, San Diego

    2014-01-01

    In the next years, processor architectures based on much larger numbers of cores will be most likely the model to continue 'Moore's Law' style throughput gains. This not only results in many more jobs in parallel running the LHC Run 1 era monolithic applications, but also the memory requirements of these processes push the workernode architectures to the limit. One solution is parallelizing the application itself, through forking and memory sharing or through threaded frameworks. CMS is following all of these approaches and has a comprehensive strategy to schedule multicore jobs on the GRID based on the glideinWMS submission infrastructure. The main component of the scheduling strategy, a pilot-based model with dynamic partitioning of resources that allows the transition to multicore or whole-node scheduling without disallowing the use of single-core jobs, is described. This contribution also presents the experiences made with the proposed multicore scheduling schema and gives an outlook of further developments working towards the restart of the LHC in 2015.

  20. Fast structural design and analysis via hybrid domain decomposition on massively parallel processors

    Science.gov (United States)

    Farhat, Charbel

    1993-01-01

    A hybrid domain decomposition framework for static, transient and eigen finite element analyses of structural mechanics problems is presented. Its basic ingredients include physical substructuring and /or automatic mesh partitioning, mapping algorithms, 'gluing' approximations for fast design modifications and evaluations, and fast direct and preconditioned iterative solvers for local and interface subproblems. The overall methodology is illustrated with the structural design of a solar viewing payload that is scheduled to fly in March 1993. This payload has been entirely designed and validated by a group of undergraduate students at the University of Colorado using the proposed hybrid domain decomposition approach on a massively parallel processor. Performance results are reported on the CRAY Y-MP/8 and the iPSC-860/64 Touchstone systems, which represent both extreme parallel architectures. The hybrid domain decomposition methodology is shown to outperform leading solution algorithms and to exhibit an excellent parallel scalability.

  1. DEVELOPMENT OF A FAST MICRON-RESOLUTION BEAM POSITION MONITOR SIGNAL PROCESSOR FOR LINEAR COLLIDER BEAMBASED FEEDBACK SYSTEMS

    CERN Document Server

    Apsimon, R; Clarke, C; Constance, B; Dabiri Khah, H; Hartin, T; Perry, C; Resta Lopez, J; Swinson, C; Christian, G B; Kalinin, A

    2009-01-01

    We present the design of a prototype fast beam position monitor (BPM) signal processor for use in inter-bunch beam-based feedbacks for linear colliders and electron linacs. We describe the FONT4 intra-train beam-based digital position feedback system prototype deployed at the Accelerator test facility (ATF) extraction line at KEK, Japan. The system incorporates a fast analogue beam position monitor front-end signal processor, a digital feedback board, and a fast kicker-driver amplifier. The total feedback system latency is less than 150ns, of which less than 10ns is used for the BPM processor. We report preliminary results of beam tests using electron bunches separated by c. 150ns. Position resolution of order 1 micron was obtained.

  2. Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments

    Science.gov (United States)

    2013-05-01

    communication cost. Traditionally, studies with PDES algorithms on Beowulf Clusters are almost singularly focused on tolerating the impact of latency because of...on multicore and manycore processors and clusters of multicores. Specifically, we designed and optimized a multithreaded version of ROSS simulator to...16 3.2 Evaluating and Optimizing ROSS-MT for Clusters of Multicores .............................. 18

  3. Associative Memory Design for the Fast TracKer Processor (FTK)at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Beretta, M; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Hoff, J; Liu, T; Liberali, V; Sacco, I; Schoening, A; Soltveit, H K; Stabile, A; Tripiccione, R

    2011-01-01

    We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. A large CAM bank stores all trajectories of interest and extracts the ones compatible with a given event. This task is naturally parallelized by a CAM architecture able to output identified trajectories, recognized among a huge amount of possible combinations, in just a few 100 MHz clock cycles. We have developed this device (called the AMchip03 processor), using 180 nm technology, for the Silicon Vertex Trigger (SVT) upgrade at CDF [1] using a standard-cell VLSI design methodology. We propose a new design that introduces a full custom CAM cell and takes advantage of 65 nm technology. The customized design maximizes the pattern density, minimizes the power consumption and implements the functionalities needed for the planned Fast Tracker (FTK) [2], an ATLAS trigger upgrade project at LHC. We introduce a new variable resolution patt...

  4. Low pain vs no pain multi-core Haskells

    DEFF Research Database (Denmark)

    Aswad, Mustafa; Trinder, Phil; Al Zain, Abyd

    2011-01-01

    Multicore and NUMA architectures are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is a systematic progra...

  5. CMS Multicore Scheduling Strategy

    CERN Document Server

    Perez-Calero Yzquierdo, Antonio

    2014-01-01

    In the next years, processor architectures based on much larger numbers of cores will be most likely the model to continue Moores Law style throughput gains. This not only results in many more jobs in parallel running the LHC Run 1 era monolithic applications. Also the memory requirements of these processes push the workernode architectures to the limit. One solution is parallelizing the application itself, through forking and memory sharing or through threaded frameworks. CMS is following all of these approaches and has a comprehensive strategy to schedule multi-core jobs on the GRID based on the glideIn WMS submission infrastructure. We will present the individual components of the strategy, from special site specific queues used during provisioning of resources and implications to scheduling; to dynamic partitioning within a single pilot to allow to transition to multi-core or whole-node scheduling on site level without disallowing single-core jobs. In this presentation, we will present the experiences mad...

  6. Frequent Graph Mining on Multi-Core Processor%多核处理器上的频繁图挖掘方法

    Institute of Scientific and Technical Information of China (English)

    栾华; 周明全; 付艳

    2015-01-01

    M ulti‐core processors have become the mainstream of modern processor architecture . Frequent graph mining is a popular problem that has practical applications in many domains . Accelerating the mining process of frequent graphs by taking full advantage of multi‐core processors has research significance and practical values .A parallel mining strategy based on depth‐first search (DFS) is proposed and a task pool is used to maintain the workload .Compared with the method that utilizes breadth‐first search ,data temporal locality performance can be improved and a large amount of memory is saved .Cache conscious node‐edge arrays in which record data of a thread are arranged continuously are designed to decrease the data size to represent original graphs and cache miss ratio . False sharing that severely degrades performance is mostly eliminated . In order to reduce lock contentions ,a flexible method is explored to look for work tasks and memory management queues are utilized to reduce the overhead due to frequent memory allocation and free operations . A detailed performance study and analysis is conducted on both synthetic data and real data sets .The results show that the proposed techniques can efficiently lower memory usage and cache misses and achieve a 10‐fold speedup on a 12‐core machine .%多核处理器已经成为现代处理器的主流体系结构,频繁图挖掘(frequent graph mining )是一个具有很多应用领域的研究热点问题,充分利用多核处理器的能力加速频繁图挖掘过程具有研究意义和实用价值。提出一种基于深度优先遍历的并行挖掘模式,使用任务池维护工作负载,提高数据的时间局部性并减少大量的内存使用;设计缓存敏感的点边数组,连续排列线程的记录数据,减少原始图的数据量,降低缓存缺失率;为了减少锁的竞争,使用灵活的任务获取方法寻找工作任务,采用内存管理队列降低频繁

  7. A Coprocessor Design for the Computation of Transcendental Functions in Multi-Core Processor%多核处理器中的超越函数协处理器设计

    Institute of Scientific and Technical Information of China (English)

    黄小康; 杜慧敏; 李涛; 周佳佳

    2016-01-01

    SMT-PAAG is a multi-core processor for graphics,image and digital signal processing.This article describes the design principle, features, implementation and verification of a coprocessor which uses for transcendental functions based on SMT-PAAG.The coprocessor implements a four-way count channel with a fully pipelined architecture,based on the piecewise linear approximation algorithm,which unified various arithmetic operation including vector multiplication, division, square root, dot product, trigonometric, exponential and logarithmic computation.Finally,we simulated correctly in the simulation platform of system verilog and got the statistical errors for each operation.%SMT-PAAG是一种专用于图形、图像及数字信号处理的多核处理器.介绍了SMT-PAAG处理器中的专用于计算超越函数的协处理器设计,包括其原理、特点、实现和验证.协处理器采用完全流水线结构,基于分段线性逼近的算法实现了一个统一了多种运算的四路算数通道,这些运算包括向量乘法、除法、平方根、点积、三角函数、幂指数及基于任意底的对数运算.最后在system verilog仿真平台上仿真通过并统计了每种运算的误差.

  8. Challenges of Algebraic Multigrid across Multicore Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Baker, A H; Gamblin, T; Schulz, M; Yang, U M

    2010-04-12

    Algebraic multigrid (AMG) is a popular solver for large-scale scientific computing and an essential component of many simulation codes. AMG has shown to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can significantly deteriorate AMG's performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM BlueGene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.

  9. Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

    Science.gov (United States)

    2015-06-01

    Particle-to- Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method by Richard H Haney and Dale Shires......ARL-TR-7315 ● JUNE 2015 US Army Research Laboratory Analysis and Implementation of Particle-to- Particle (P2P) Graphics Processor

  10. Neural simulations on multi-core architectures

    Directory of Open Access Journals (Sweden)

    Hubert Eichner

    2009-07-01

    Full Text Available Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i. e. user-transparent load balancing.

  11. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    Science.gov (United States)

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  12. The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

    CERN Document Server

    Annovi, A; Bevacqua, V; Cervigni, F; Crescioli, F; Fabbri, L; Giannetti, P; Giorgi, F; Magalotti, D; Negri, A; Piendibene, M; Roda, C; Sbarra, C; Villa, M; Vitillo, RA; Volpi, G

    2012-01-01

    The FastTracKer (FTK), a hardware dedicated processor, performs fast and precise online full track reconstruction at the ATLAS experiment, within an average latency of few dozens of microseconds. \\ It is made of two pipelined processors, the Associative Memory finding low precision tracks, and the Track Fitter refining the track quality with high precision fits. FTK has to face the Large Hadron Collider (LHC) Phase I luminosity. So, while the new processor requires the best of the available technology for tracking in high occupancy conditions, we want to use already existing prototypes to exercise soon the FTK functions in the new ATLAS environment. Few boards connected together constitute a "baby FTK" that will grow soon becoming the "vertical slice".\\ The vertical slice will cover a small projective wedge in the detector, but it will be functionally complete. It will provide a full test of the entire FTK data chain, in the laboratory first and on beam-on conditions after. It will require early development a...

  13. 基于Octeon多核网络处理器的IPv6联动IPS研究与设计%Research and Design of Linked IPS Based on Octeon Multi-core Network Processor for IPv6

    Institute of Scientific and Technical Information of China (English)

    杨吉喆; 王玲玲; 陆建德

    2011-01-01

    对基于Octeon多核网络处理器的新一代IPv6高速网络联动入侵防御系统进行研究,设计了新型联动入侵防御原型.系统基于Octeon多核的高速处理,并结合了IPv6网络中入侵的新特点.在基于入侵检测规则库规则匹配技术的基础上,运用新型的协议分析技术和基于流的检测技术,在Octeon多核间分配控制层与数据层的不同执行,采用命名块机制进行多核间通信,通过数据层核向控制层核的反馈,实现了流处理及协议分析模块与控制模块的高速联动.系统实现了Gbps级的高速入侵检测与联动防御处理.%The paper has made research to the Linked Intrusion Prevention System based on Octeon multi-core network processor for new generation high-speed IPv6 network.,and designed a new type of prototype.The system design is based on high-speed processing on Octeon multi-core,and combines new intrusion characteristics occurred in IPv6 network.On the basis of the technique of matching rules in rule library for intrusion detection,and using the new protocol analysis and flow-based detection techniques,the different executions including control plane and data plane are distributed on multiple cores of Octeon.Adopting the mechanism of named blocks to communicate between multiple cores,and by means of the feedbacks from the cores running data plane code to the control plane core,the system has realized the high-speed linking between the flow processing,protocol analysis module and the control module,which is competent for the high-speed intrusion detection and linked prevention at Gbps level.

  14. Summary of multi-core hardware and programming model investigations

    Energy Technology Data Exchange (ETDEWEB)

    Kelly, Suzanne Marie; Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.

    2008-05-01

    This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community.

  15. Structure_threader: An improved method for automation and parallelization of programs structure, fastStructure and MavericK on multicore CPU systems.

    Science.gov (United States)

    Pina-Martins, Francisco; Silva, Diogo N; Fino, Joana; Paulo, Octávio S

    2017-08-04

    Structure_threader is a program to parallelize multiple runs of genetic clustering software that does not make use of multithreading technology (structure, fastStructure and MavericK) on multicore computers. Our approach was benchmarked across multiple systems and displayed great speed improvements relative to the single-threaded implementation, scaling very close to linearly with the number of physical cores used. Structure_threader was compared to previous software written for the same task-ParallelStructure and StrAuto and was proven to be the faster (up to 25% faster) wrapper under all tested scenarios. Furthermore, Structure_threader can perform several automatic and convenient operations, assisting the user in assessing the most biologically likely value of 'K' via implementations such as the "Evanno," or "Thermodynamic Integration" tests and automatically draw the "meanQ" plots (static or interactive) for each value of K (or even combined plots). Structure_threader is written in python 3 and licensed under the GPLv3. It can be downloaded free of charge at https://github.com/StuntsPT/Structure_threader. © 2017 John Wiley & Sons Ltd.

  16. Examination of Multi-Core Architectures

    Science.gov (United States)

    2010-11-01

    3.3 Cell Broadband Engine 5 3.4 Tera -Op Reliable Intelligently Adaptive Processing System (TRIPS) 6 3.5 Intel 6 4.0...26] Figure 6 – Intel Core i7 Architecture [18] 3.4 Tera -Op Reliable Intelligently Adaptive Processing System (TRIPS) The TRIPS processor, seen...Multicore Development Environment Product Brief, www.tilera.com 26. Keckler, Stephen, Buger, Doug, McKinley, Kathryn, Crago, Steve, Lethin, Richard, Tera

  17. T-CREST: Time-predictable multi-core architecture for embedded systems

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Abbaspourseyedi, Sahar; Jordan, Alexander

    2015-01-01

    Real-time systems need time-predictable platforms to allow static analysis of the worst-case execution time (WCET). Standard multi-core processors are optimized for the average case and are hardly analyzable. Within the T-CREST project we propose novel solutions for time-predictable multi-core ar...

  18. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    OpenAIRE

    Kaufmann Michael; Nieselt Kay; Schmollinger Martin; Morgenstern Burkhard

    2004-01-01

    Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pa...

  19. Fast Track Pattern Recognition in High Energy Physics Experiments with the Automata Processor

    CERN Document Server

    Wang, Michael H L S; Green, Christopher; Guo, Deyuan; Wang, Ke; Zmuda, Ted

    2016-01-01

    We explore the Micron Automata Processor (AP) as a suitable commodity technology that can address the growing computational needs of track pattern recognition in High Energy Physics experiments. A toy detector model is developed for which a track trigger based on the Micron AP is used to demonstrate a proof-of-principle. Although primarily meant for high speed text-based searches, we demonstrate that the Micron AP is ideally suited to track finding applications.

  20. The EDRO board connected to the Associative Memory: a "Baby" FastTracKer processor for the ATLAS experiment

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Villa, M; Bevacqua, V; Vitillo, R A; Giorgi, F; Magalotti, D; Roda, C; Cervigni, F; Giannetti, P; Negri, A; Piendibene, M; Volpi, G; Fabbri, L; Sbarra, C

    2011-01-01

    The FastTracKer (FTK) is a dedicated hardware system able to perform online fast and precise track reconstruction of the full events at the Atlas experiment, within an average latency of few dozens of microseconds. It is made of two pipelined processors: the Associative Memory (AM), finding low precision tracks called "roads", and the Track Fitter (TF), refining the track quality with high precision fits. The FTK design [1] that works well at the Large Hadron Collider (LHC) Phase I luminosity requires the best of the available technology for tracking in high occupancy conditions. While the new processor is designed for the most demanding LHC conditions, we want to use already existing prototypes, part of them developed for the SLIM5 collaboration [2], to exercise the FTK functions in the new Atlas environment. During Laboratory tests, the EDRO board (Event Dispatch and Read-Out) receives on a clustering mezzanine (able to calculate the pixel and SCT cluster centroids) "fake" detector raw data on S-links from ...

  1. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    Directory of Open Access Journals (Sweden)

    Kaufmann Michael

    2004-09-01

    Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

  2. Associative Memory design for the Fast TracK processor (FTK) at Atlas

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Piendibene, M; Sacco, I; Sartori, L; Tripiccione, R

    2010-01-01

    We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. A large CAM bank stores all trajectories of interest and extracts the ones compatible with a given event. This task is naturally parallelized by a CAM architecture able to output identified trajectories, recognized among a huge amount of possible combinations, in just a few 100 MHz clock cycles. We have developed this device (called the AMchip03 processor), using 180 nm technology, for the Silicon Vertex Trigger (SVT) upgrade at CDF using a standard-cell VLSI design methodology. We propose now a new design (90 nm technology) where we introduce a full custom standard cell. This is a customized design that allows to maximize the pattern density and to minimize the power consumption. We discuss also possible future extensions based on 3-D technology. This processor has a flexible and easily configurable structure that makes it suitabl...

  3. A highly efficient multi-core algorithm for clustering extremely large datasets

    Directory of Open Access Journals (Sweden)

    Kraus Johann M

    2010-04-01

    Full Text Available Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

  4. QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors.

    Science.gov (United States)

    Gudyś, Adam; Deorowicz, Sebastian

    2014-01-01

    Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.

  5. QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors

    Science.gov (United States)

    Gudyś, Adam; Deorowicz, Sebastian

    2014-01-01

    Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435

  6. QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors.

    Directory of Open Access Journals (Sweden)

    Adam Gudyś

    Full Text Available Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.

  7. The Serial Link Processor for the Fast TracKer (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Biesuz, N; Billereau, W; Combe, J M; Citterio, M; Citraro, S; Crescioli, F; Dimas, D; Donati, S; Gentsos, C; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Rossi, E; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A; Vulliez, P

    2014-01-01

    The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problem inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report also on the performance of a first prototype based on the use of a min@sic AM chip, a small but complete version ...

  8. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. ...

  9. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. T...

  10. Fast start-up of microchannel fuel processor integrated with an igniter for hydrogen combustion

    Science.gov (United States)

    Ryi, Shin Kun; Park, Jong Soo; Cho, Song Ho; Kim, Sung Hyun

    A Pt-Zr catalyst coated FeCrAlY mesh is introduced into the combustion outlet conduit of a newly designed microchannel reactor (MCR) as an igniter of hydrogen combustion to decrease the start-up time. The catalyst is coated using a wash-coating method. After installing the Pt-Zr/FeCrAlY mesh, the reactor is heated to its running temperature within 1 min with hydrogen combustion. Two plate-type heat-exchangers are introduced at the combustion outlet and reforming outlet conduits of the microchannel reactor in order to recover the heat of the combustion gas and reformed gas, respectively. Using these heat-exchangers, methane steam reforming is carried out with hydrogen combustion and the reforming capacity and energy efficiency are enhanced by up to 3.4 and 1.7 times, respectively. A compact fuel processor and fuel-cell system using this reactor concept is expected to show considerable advancement.

  11. A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system

    Science.gov (United States)

    Olariu, S.; Schwing, J.; Zhang, J.

    1991-01-01

    A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

  12. DYNAMIC TASK SCHEDULING ON MULTICORE AUTOMOTIVE ECUS

    Directory of Open Access Journals (Sweden)

    Geetishree Mishra

    2014-12-01

    Full Text Available Automobile manufacturers are controlled by stringent govt. regulations for safety and fuel emissions and motivated towards adding more advanced features and sophisticated applications to the existing electronic system. Ever increasing customer’s demands for high level of comfort also necessitate providing even more sophistication in vehicle electronics system. All these, directly make the vehicle software system more complex and computationally more intensive. In turn, this demands very high computational capability of the microprocessor used in electronic control unit (ECU. In this regard, multicore processors have already been implemented in some of the task rigorous ECUs like, power train, image processing and infotainment. To achieve greater performance from these multicore processors, parallelized ECU software needs to be efficiently scheduled by the underlaying operating system for execution to utilize all the computational cores to the maximum extent possible and meet the real time constraint. In this paper, we propose a dynamic task scheduler for multicore engine control ECU that provides maximum CPU utilization, minimized preemption overhead, minimum average waiting time and all the tasks meet their real time deadlines while compared to the static priority scheduling suggested by Automotive Open Systems Architecture (AUTOSAR

  13. Dynamic Task Scheduling on Multicore Automotive ECUs

    Directory of Open Access Journals (Sweden)

    Geetishree Mishra

    2014-12-01

    Full Text Available Automobile manufacturers are controlled by stringen t govt. regulations for safety and fuel emissions a nd motivated towards adding more advanced features and sophisticated applications to the existing electro nic system. Ever increasing customer’s demands for high level of comfort also necessitate providing even m ore sophistication in vehicle electronics system. All t hese, directly make the vehicle software system mor e complex and computationally more intensive. In turn , this demands very high computational capability o f the microprocessor used in electronic control unit (ECU. In this regard, multicore processors have already been implemented in some of the task rigoro us ECUs like, power train, image processing and infotainment. To achieve greater performance from t hese multicore processors, parallelized ECU softwar e needs to be efficiently scheduled by the underlayin g operating system for execution to utilize all the computational cores to the maximum extent possible and meet the real time constraint. In this paper, w e propose a dynamic task scheduler for multicore engi ne control ECU that provides maximum CPU utilization, minimized preemption overhead, minimum average waiting time and all the tasks meet their real time deadlines while compared to the static pr iority scheduling suggested by Automotive Open Syst ems Architecture (AUTOSAR.

  14. Algorithms for Optimally Arranging Multicore Memory Structures

    Directory of Open Access Journals (Sweden)

    Wei-Che Tseng

    2010-01-01

    Full Text Available As more processing cores are added to embedded systems processors, the relationships between cores and memories have more influence on the energy consumption of the processor. In this paper, we conduct fundamental research to explore the effects of memory sharing on energy in a multicore processor. We study the Memory Arrangement (MA Problem. We prove that the general case of MA is NP-complete. We present an optimal algorithm for solving linear MA and optimal and heuristic algorithms for solving rectangular MA. On average, we can produce arrangements that consume 49% less energy than an all shared memory arrangement and 14% less energy than an all private memory arrangement for randomly generated instances. For DSP benchmarks, we can produce arrangements that, on average, consume 20% less energy than an all shared memory arrangement and 27% less energy than an all private memory arrangement.

  15. MonetDB/XQuery: A fast XQuery processor powered by a relational engine

    NARCIS (Netherlands)

    Boncz, P.A.; Grust, T.; Keulen, M. van; Manegold, S.; Rittinger, J.; Teubner, J.

    2006-01-01

    Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-bas

  16. MonetDB/XQuery: a fast XQuery processor powered by a relational engine

    NARCIS (Netherlands)

    Boncz, P.; Grust, T.; Keulen, van M.; Manegold, S.; Rittinger, J.; Teubner, J.

    2006-01-01

    Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-bas

  17. An image sensor with fast extraction of objects' positions - Rough vision processor

    OpenAIRE

    Watanabe, Akira; Miyama, Masayuki; Tooyama, Osamu; Yoshimoto, Masahiko; Akita, Junichi

    2001-01-01

    An integration of the signal processing circuits with the image acquiring device, which is called the vision chip and can process information in parallel, is proposed for fast image processing. In applications for robot vision, not only the detailed information, such as shape or texture, but also the rough information, such as 'something is around here', are important and useful. We consider the detecting of centroids of objects in the focal plain as the rough vision processing, which is usef...

  18. The Associative Memory Serial Link Processor for the Fast TracKer (FTK) at ATLAS

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Biesuz, N; Billereau, W; Cipriani, R; Citraro, S; Citterio, M; Colombo, A; Combe, J M; Crescioli, F; Donati, S; Gentsos, C; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Rossi, E; Shojaii, S; Sotiropoulou, C-L; Stabile, A; Vulliez, P

    2014-01-01

    The Fast TracKer (FTK) is an extremely powerful and very compact processing unit, essential for efficient Level 2 trigger selection in future high-energy physics experiments at LHC. FTK employs Associative Memories (AM) to perform pattern recognition; input and output data are transmitted over serial links at 2 Gbit/s, to reduce routing congestion at board level. Prototypes of the AM chip and of the AM board have been manufactured and tested, in view of the imminent design of the final version.

  19. Energy consumption model over parallel programs implemented on multicore architectures

    Directory of Open Access Journals (Sweden)

    Ricardo Isidro-Ramirez

    2015-06-01

    Full Text Available In High Performance Computing, energy consump-tion is becoming an important aspect to consider. Due to the high costs that represent energy production in all countries it holds an important role and it seek to find ways to save energy. It is reflected in some efforts to reduce the energy requirements of hardware components and applications. Some options have been appearing in order to scale down energy use and, con-sequently, scale up energy efficiency. One of these strategies is the multithread programming paradigm, whose purpose is to produce parallel programs able to use the full amount of computing resources available in a microprocessor. That energy saving strategy focuses on efficient use of multicore processors that are found in various computing devices, like mobile devices. Actually, as a growing trend, multicore processors are found as part of various specific purpose computers since 2003, from High Performance Computing servers to mobile devices. However, it is not clear how multiprogramming affects energy efficiency. This paper presents an analysis of different types of multicore-based architectures used in computing, and then a valid model is presented. Based on Amdahl’s Law, a model that considers different scenarios of energy use in multicore architectures it is proposed. Some interesting results were found from experiments with the developed algorithm, that it was execute of a parallel and sequential way. A lower limit of energy consumption was found in a type of multicore architecture and this behavior was observed experimentally.

  20. Evaluating Multicore Algorithms on the Unified Memory Model

    Directory of Open Access Journals (Sweden)

    John E. Savage

    2009-01-01

    Full Text Available One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores. It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.

  1. Improving the scalability of multicore systems with a focus on H.264 video decoding

    NARCIS (Netherlands)

    Meenderinck, C.H.

    2010-01-01

    In pursuit of ever increasing performance, more and more processor architectures have become multicore processors. As clock frequency was no longer increasing rapidly and ILP techniques showed diminishing results, increasing the number of cores per chip was the natural choice. The transistor budget

  2. Evolution of CMS Workload Management Towards Multicore Job Support

    Energy Technology Data Exchange (ETDEWEB)

    Perez-Calero Yzquierdo, A. [Madrid, CIEMAT; Hernández, J. M. [Madrid, CIEMAT; Khan, F. A. [Quaid-i-Azam U.; Letts, J. [UC, San Diego; Majewski, K. [Fermilab; Rodrigues, A. M. [Fermilab; McCrea, A. [UC, San Diego; Vaandering, E. [Fermilab

    2015-12-23

    The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.

  3. Evolution of CMS workload management towards multicore job support

    Science.gov (United States)

    Pérez-Calero Yzquierdo, A.; Hernández, J. M.; Khan, F. A.; Letts, J.; Majewski, K.; Rodrigues, A. M.; McCrea, A.; Vaandering, E.

    2015-12-01

    The successful exploitation of multicore processor architectures is a key element of the LHC distributed computing system in the coming era of the LHC Run 2. High-pileup complex-collision events represent a challenge for the traditional sequential programming in terms of memory and processing time budget. The CMS data production and processing framework is introducing the parallel execution of the reconstruction and simulation algorithms to overcome these limitations. CMS plans to execute multicore jobs while still supporting singlecore processing for other tasks difficult to parallelize, such as user analysis. The CMS strategy for job management thus aims at integrating single and multicore job scheduling across the Grid. This is accomplished by employing multicore pilots with internal dynamic partitioning of the allocated resources, capable of running payloads of various core counts simultaneously. An extensive test programme has been conducted to enable multicore scheduling with the various local batch systems available at CMS sites, with the focus on the Tier-0 and Tier-1s, responsible during 2015 of the prompt data reconstruction. Scale tests have been run to analyse the performance of this scheduling strategy and ensure an efficient use of the distributed resources. This paper presents the evolution of the CMS job management and resource provisioning systems in order to support this hybrid scheduling model, as well as its deployment and performance tests, which will enable CMS to transition to a multicore production model for the second LHC run.

  4. Using Multi-Core Systems for Rover Autonomy

    Science.gov (United States)

    Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.

    2010-01-01

    Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality

  5. Multicore systems on-chip practical software/hardware design

    CERN Document Server

    Abdallah, Abderazek Ben

    2013-01-01

    System on chips designs have evolved from fairly simple unicore, single memory designs to complex heterogeneous multicore SoC architectures consisting of a large number of IP blocks on the same silicon. To meet high computational demands posed by latest consumer electronic devices, most current systems are based on such paradigm, which represents a real revolution in many aspects in computing.The attraction of multicore processing for power reduction is compelling. By splitting a set of tasks among multiple processor cores, the operating frequency necessary for each core can be reduced, allowi

  6. A pipeline VLSI design of fast singular value decomposition processor for real-time EEG system based on on-line recursive independent component analysis.

    Science.gov (United States)

    Huang, Kuan-Ju; Shih, Wei-Yeh; Chang, Jui Chung; Feng, Chih Wei; Fang, Wai-Chi

    2013-01-01

    This paper presents a pipeline VLSI design of fast singular value decomposition (SVD) processor for real-time electroencephalography (EEG) system based on on-line recursive independent component analysis (ORICA). Since SVD is used frequently in computations of the real-time EEG system, a low-latency and high-accuracy SVD processor is essential. During the EEG system process, the proposed SVD processor aims to solve the diagonal, inverse and inverse square root matrices of the target matrices in real time. Generally, SVD requires a huge amount of computation in hardware implementation. Therefore, this work proposes a novel design concept for data flow updating to assist the pipeline VLSI implementation. The SVD processor can greatly improve the feasibility of real-time EEG system applications such as brain computer interfaces (BCIs). The proposed architecture is implemented using TSMC 90 nm CMOS technology. The sample rate of EEG raw data adopts 128 Hz. The core size of the SVD processor is 580×580 um(2), and the speed of operation frequency is 20MHz. It consumes 0.774mW of power during the 8-channel EEG system per execution time.

  7. Intel Cilk Plus for Complex Parallel Algorithms: "Enormous Fast Fourier Transform" (EFFT) Library

    OpenAIRE

    Asai, Ryo; Vladimirov, Andrey

    2014-01-01

    In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk P...

  8. A GPU-based real time high performance computing service in a fast plant system controller prototype for ITER

    Energy Technology Data Exchange (ETDEWEB)

    Nieto, J., E-mail: jnieto@sec.upm.es [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Arcas, G. de; Ruiz, M. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Vega, J. [Asociacion EURATOM/CIEMAT para Fusion, Madrid (Spain); Lopez, J.M.; Barrera, E. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Castro, R. [Asociacion EURATOM/CIEMAT para Fusion, Madrid (Spain); Sanz, D. [Grupo de Investigacion en Instrumentacion y Acustica Aplicada. Universidad Politecnica de Madrid, Crta. Valencia Km-7, Madrid 28031 Spain (Spain); Utzel, N.; Makijarvi, P.; Zabeo, L. [ITER Organization, CS 90 046, 13067 St. Paul lez Durance Cedex (France)

    2012-12-15

    Highlights: Black-Right-Pointing-Pointer Implementation of fast plant system controller (FPSC) for ITER CODAC. Black-Right-Pointing-Pointer GPU-based real time high performance computing service. Black-Right-Pointing-Pointer Performance evaluation with respect to other solutions based in multi-core processors. - Abstract: EURATOM/CIEMAT and the Technical University of Madrid UPM are involved in the development of a FPSC (fast plant system control) prototype for ITER based on PXIe form factor. The FPSC architecture includes a GPU-based real time high performance computing service which has been integrated under EPICS (experimental physics and industrial control system). In this work we present the design of this service and its performance evaluation with respect to other solutions based in multi-core processors. Plasma pre-processing algorithms, illustrative of the type of tasks that could be required for both control and diagnostics, are used during the performance evaluation.

  9. Improved Thermal Reconstruction Method Based on Dynamic Voronoi Diagram with Non-uniform Sampling on Multicore Processors%基于动态Voronoi图的多核处理器非均匀采样热重构改进方法

    Institute of Scientific and Technical Information of China (English)

    李鑫; 戎蒙恬; 刘涛; 周亮

    2013-01-01

    The lower accuracy of hot spot temperature estimation on microprocessors can lead to a higher probability of false alarms and unnecessary responses,which results in a reduction of the reliability of computer systems.In this paper,an improved thermal reconstruction method based on dynamic Voronoi diagram with non-uniform sampling on multicore processors has been proposed.Experimental results indicate that the proposed method significantly outperforms spectral analysis techniques in both thermal reconstruction error and hot spot temperature error.It can he better applied in dynamic thermal management techniques to achieve global and local thermal monitoring.%微处理器热监控中不精确的热点温度估计会导致错误的预警和不必要的响应.为了更好的监控微处理器的运行温度,提出了一种基于动态Voronoi图的多核处理器非均匀采样热重构改进方法.实验结果表明:该方法比现有的频谱技术在热重构平均温度误差和热点温度误差精度方面有了一定提高,能够有效运用在动态热管理技术中实现精确的全局和局部温度监控.

  10. High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

    Directory of Open Access Journals (Sweden)

    H. Y. Su

    2012-04-01

    Full Text Available This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms.

  11. High-throughput Bayesian Network Learning using Heterogeneous Multicore Computers.

    Science.gov (United States)

    Linderman, Michael D; Athalye, Vivek; Meng, Teresa H; Asadi, Narges Bani; Bruggner, Robert; Nolan, Garry P

    2010-06-01

    Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of Bayesian Networks (BNs) is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20-50 nodes). In this paper, we present a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than current general-purpose processor (GPP)-based implementations. The GPU-based implementation is just one of several implementations within the larger application, each optimized for a different input or machine configuration. We describe the methodology we use to build an extensible application, assembled from these variants, that can target a broad range of heterogeneous systems, e.g., GPUs, multicore GPPs. Specifically we show how we use the Merge programming model to efficiently integrate, test and intelligently select among the different potential implementations.

  12. Multi-Core Technology for and Fault Tolerant High-Performance Spacecraft Computer Systems

    Science.gov (United States)

    Behr, Peter M.; Haulsen, Ivo; Van Kampenhout, J. Reinier; Pletner, Samuel

    2012-08-01

    The current architectural trends in the field of multi-core processors can provide an enormous increase in processing power by exploiting the parallelism available in many applications. In particular because of their high energy efficiency, it is obvious that multi-core processor-based systems will also be used in future space missions. In this paper we present the system architecture of a powerful optical sensor system based on the eight core multi-core processor P4080 from Freescale. The fault tolerant structure and the highly effective FDIR concepts implemented on different hardware and software levels of the system are described in detail. The space application scenario and thus the main requirements for the sensor system have been defined by a complex tracking sensor application for autonomous landing or docking manoeuvres.

  13. Multicore Challenges and Benefits for High Performance Scientific Computing

    Directory of Open Access Journals (Sweden)

    Ida M.B. Nielsen

    2008-01-01

    Full Text Available Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexity of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.

  14. A Highly Parallel FPGA Implementation of a 2D-Clustering Algorithm for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Kimura, N; The ATLAS collaboration; Beretta, M; Gatta, M; Gkaitatzis, S; Iizawa, T; Kordas, K; Korikawa, T; Nikolaidis, N; Petridou, P; Sotiropoulou, C-L; Yorita, K; Volpi, G

    2014-01-01

    The highly parallel 2D-clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors read out drivers (RODs) at 760Gbps, the full rate of level 1 triggers. Clustering serves two purposes. The first is to reduce the high rate of the received data before further processing. The second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The implementation is fully generic, therefore the detection window size can be optimized for the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility ma...

  15. Multiple core computer processor with globally-accessible local memories

    Energy Technology Data Exchange (ETDEWEB)

    Shalf, John; Donofrio, David; Oliker, Leonid

    2016-09-20

    A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality of processor cores.

  16. High-density multicore fibers

    DEFF Research Database (Denmark)

    Takenaga, K.; Matsuo, S.; Saitoh, K.;

    2016-01-01

    High-density single-mode multicore fibers were designed and fabricated. A heterogeneous 30-core fiber realized a low crosstalk of −55 dB. A quasi-single-mode homogeneous 31-core fiber attained the highest core count as a single-mode multicore fiber....

  17. 基于多核CPU的干涉成像光谱仪快速数据重建方法%Fast Data Reconstructed Method of Fourier Transform Imaging Spectrometer Based on Multi-core CPU

    Institute of Scientific and Technical Information of China (English)

    杨智雄; 余春超; 严敏; 郑为建; 雷正刚; 粟宇路

    2014-01-01

    Imaging spectrometer can gain two-dimensional space image and one-dimensional spectrum at the same time, which shows high utility in color and spectral measurements, the true color image synthesis, military reconnaissance and so on. In order to realize the fast reconstructed processing of the Fourier transform imaging spectrometer data, the paper designed the optimization reconstructed algorithm with OpenMP parallel calculating technology, which was further used for the optimization process for the Hyper Spectral Imager of‘HJ-1’ Chinese satellite. The results show that the method based on multi-core parallel computing technology can control the multi-core CPU hardware resources competently and significantly enhance the calculation of the spectrum reconstruction processing efficiency. If the technology is applied to more cores workstation in parallel computing, it will be possible to complete Fourier transform imaging spectrometer real-time data processing with a single computer.%成像光谱仪作为一种航空航天遥感器工作,可以同时得到地物的二维空间图像信息和一维光谱的丰富信息,在颜色和光谱测量、真彩色图像合成、军事侦察等领域有着很高的实用价值。为了达到对干涉成像光谱仪数据快速处理的要求,使用 OpenMP 并行计算技术设计了基于多核 CPU 的成像光谱仪快速数据重建优化算法,并将其应用到我国“环境一号”卫星的高光谱数据处理任务中。实验结果表明,基于多核的并行计算技术能有效调动多核CPU的硬件资源,大幅度提高光谱重建处理任务的计算效率。如果将该技术应用到更多核的并行计算工作站上,单台计算机完成干涉成像光谱仪数据的实时处理任务将成为可能。

  18. A Highly Parallel FPGA Implementation of a 2D-Clustering Algorithm for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Kimura, N; The ATLAS collaboration; Beretta, M; Gatta, M; Gkaitatzis, S; Iizawa, T; Kordas, K; Korikawa, T; Nikolaidis, S; Petridou, C; Sotiropoulou, C-L; Yorita, K; Volpi, G

    2014-01-01

    The highly parallel 2D-clustering FPGA implementation used for the input system of Fast TracKer (FTK) processor for the ATLAS experiment at Large Hadron Collider (LHC) at CERN is presented. The LHC after the 2013-2014 shutdown periods is expected to increase the luminosity, which will make more difficult to have efficient online selection of rare events due to the increasing of the overlapping collisions. FTK is highly-parallelized hardware system that allows improving online selection by real time track finding using silicon inner detector information. FTK system require Fast and robust clustering of hits position from silicon detector on FPGA. We show the development of original input boards and implemented clustering algorithm. For the complicated 2D-clustering, moving window technique is used to minimize the limited FPGA resources. Developed boards and implementation of the clustering algorithm has sufficient processing power to meet the specification for silicon inner detector of ATLAS for the maximum LH...

  19. Using the automata processor for fast pattern recognition in high energy physics experiments-A proof of concept

    Science.gov (United States)

    Wang, Michael H. L. S.; Cancelo, Gustavo; Green, Christopher; Guo, Deyuan; Wang, Ke; Zmuda, Ted

    2016-10-01

    We explore the Micron Automata Processor (AP) as a suitable commodity technology that can address the growing computational needs of pattern recognition in High Energy Physics (HEP) experiments. A toy detector model is developed for which an electron track confirmation trigger based on the Micron AP serves as a test case. Although primarily meant for high speed text-based searches, we demonstrate a proof of concept for the use of the Micron AP in a HEP trigger application.

  20. Using the automata processor for fast pattern recognition in high energy physics experiments—A proof of concept

    Energy Technology Data Exchange (ETDEWEB)

    Wang, Michael H.L.S., E-mail: mwang@fnal.gov [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States); Cancelo, Gustavo; Green, Christopher [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States); Guo, Deyuan; Wang, Ke [University of Virginia, Charlottesville, VA 22904 (United States); Zmuda, Ted [Fermi National Accelerator Laboratory, Batavia, IL 60510 (United States)

    2016-10-01

    We explore the Micron Automata Processor (AP) as a suitable commodity technology that can address the growing computational needs of pattern recognition in High Energy Physics (HEP) experiments. A toy detector model is developed for which an electron track confirmation trigger based on the Micron AP serves as a test case. Although primarily meant for high speed text-based searches, we demonstrate a proof of concept for the use of the Micron AP in a HEP trigger application.

  1. The EDRO board connected to the Associative Memory: a “Baby” FastTracKer processor for the ATLAS experiment

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Bevacqua, V; Cervigni, F; Crescioli, F; Fabbri, L; Giannetti, P; Giorgi, F; Magalotti, D; Negri, A; Piendibene, M; Sbarra, C; Roda, C; Villa, M; Vitillo, R A; Volpi, G

    2011-01-01

    The FastTracKer (FTK) is a dedicated hardware system able to perform online fast and precise track reconstruction of full events in the Atlas experiment within an average latency of a few dozen microseconds. It consists of two pipelined processors: the Associative Memory (AM), which finds low precision tracks called “roads”, and the Track Fitter (TF), which refines the track quality with high precision fits. The FTK design [1] that works well at the Large Hadron Collider (LHC) Phase I upgrade luminosity requires the best of the available technology for tracking in a high occupancy environment. While the new processor is designed for the most demanding LHC conditions, we will begin with existing prototypes, some developed for the SLIM5 collaboration [2], to exercise the FTK functions in the new Atlas environment. The goal is to learn early about the FTK integration in the Atlas TDAQ. The EDRO board (Event Dispatch and Read-Out) receives on a clustering mezzanine (able to calculate the pixel and SCT cluster...

  2. VERTAF/Multi-Core: A SysML-Based Application Framework for Multi-Core Embedded Software Development

    Institute of Scientific and Technical Information of China (English)

    Chao-Sheng Lin; Chun-Hsien Lu; Shang-Wei Lin; Yean-Ru Chen; Pao-Ann Hsiung

    2011-01-01

    Multi-core processors are becoming prevalent rapidly in personal computing and embedded systems. Nev-ertheless, the programming environment for multi-core processor-based systems is still quite immature and lacks efficient tools. In this work, we present a new VERTAF/Multi-Core framework and show how software code can be automatically generated from SysML models of multi-core embedded systems. We illustrate how model-driven design based on SysML can be seamlessly integrated with Intel's threading building blocks (TBB) and the quantum framework (QF) middleware. We use a digital video recording system to illustrate the benefits of the framework. Our experiments show how SysML/QF/TBB help in making multi-core embedded system programming model-driven, easy, and efficient.

  3. Silicon photonics for multicore fiber communication

    DEFF Research Database (Denmark)

    Ding, Yunhong; Kamchevska, Valerija; Dalgaard, Kjeld

    2016-01-01

    We review our recent work on silicon photonics for multicore fiber communication, including multicore fiber fan-in/fan-out, multicore fiber switches towards reconfigurable optical add/drop multiplexers. We also present multicore fiber based quantum communication using silicon devices....

  4. Fuzzy logic based power-efficient real-time multi-core system

    CERN Document Server

    Ahmed, Jameel; Najam, Shaheryar; Najam, Zohaib

    2017-01-01

    This book focuses on identifying the performance challenges involved in computer architectures, optimal configuration settings and analysing their impact on the performance of multi-core architectures. Proposing a power and throughput-aware fuzzy-logic-based reconfiguration for Multi-Processor Systems on Chip (MPSoCs) in both simulation and real-time environments, it is divided into two major parts. The first part deals with the simulation-based power and throughput-aware fuzzy logic reconfiguration for multi-core architectures, presenting the results of a detailed analysis on the factors impacting the power consumption and performance of MPSoCs. In turn, the second part highlights the real-time implementation of fuzzy-logic-based power-efficient reconfigurable multi-core architectures for Intel and Leone3 processors. .

  5. Hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    Science.gov (United States)

    Howison, M.; Bethel, E. W.; Childs, H.

    2011-10-01

    This work studies the performance and scalability characteristics of "hybrid" parallel programming and execution as applied to raycasting volume rendering - a staple visualization algorithm - on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today, as well as processors capable of running hundreds of concurrent threads (GPUs), we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  6. Energy Efficient Multi-Core Processing

    Directory of Open Access Journals (Sweden)

    Charles Leech

    2014-06-01

    Full Text Available This paper evaluates the present state of the art of energy-efficient embedded processor design techniques and demonstrates, how small, variable-architecture embedded processors may exploit a run-time minimal architectural synthesis technique to achieve greater energy and area efficiency whilst maintaining performance. The picoMIPS architecture is presented, inspired by the MIPS, as an example of a minimal and energy efficient processor. The picoMIPS is a variablearchitecture RISC microprocessor with an application-specific minimised instruction set. Each implementation will contain only the necessary datapath elements in order to maximise area efficiency. Due to the relationship between logic gate count and power consumption, energy efficiency is also maximised in the processor therefore the system is designed to perform a specific task in the most efficient processor-based form. The principles of the picoMIPS processor are illustrated with an example of the discrete cosine transform (DCT and inverse DCT (IDCT algorithms implemented in a multi-core context to demonstrate the concept of minimal architecture synthesis and how it can be used to produce an application specific, energy efficient processor.

  7. Concurrent Programming in Mac OS X and iOS Unleash Multicore Performance with Grand Central Dispatch

    CERN Document Server

    Nahavandipoor, Vandad

    2011-01-01

    Now that multicore processors are coming to mobile devices, wouldn't it be great to take advantage of all those cores without having to manage threads? This concise book shows you how to use Apple's Grand Central Dispatch (GCD) to simplify programming on multicore iOS devices and Mac OS X. Managing your application's resources on more than one core isn't easy, but it's vital. Apps that use only one core in a multicore environment will slow to a crawl. If you know how to program with Cocoa or Cocoa Touch, this guide will get you started with GCD right away, with many examples to help you writ

  8. Multi-core: Adding a New Dimension to Computing

    CERN Document Server

    Amin, Md Tanvir Al

    2010-01-01

    Invention of Transistors in 1948 started a new era in technology, called Solid State Electronics. Since then, sustaining development and advancement in electronics and fabrication techniques has caused the devices to shrink in size and become smaller, paving the quest for increasing density and clock speed. That quest has suddenly come to a halt due to fundamental bounds applied by physical laws. But, demand for more and more computational power is still prevalent in the computing world. As a result, the microprocessor industry has started exploring the technology along a different dimension. Speed of a single work unit (CPU) is no longer the concern, rather increasing the number of independent processor cores packed in a single package has become the new concern. Such processors are commonly known as multi-core processors. Scaling the performance by using multiple cores has gained so much attention from the academia and the industry, that not only desktops, but also laptops, PDAs, cell phones and even embedd...

  9. Spaceborne Processor Array

    Science.gov (United States)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  10. Multicore Programming Challenges

    Science.gov (United States)

    Perrone, Michael

    The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.

  11. Hardware Implementation of a Genetic Algorithm Based Canonical Singed Digit Multiplierless Fast Fourier Transform Processor for Multiband Orthogonal Frequency Division Multiplexing Ultra Wideband Applications

    Directory of Open Access Journals (Sweden)

    Mahmud Benhamid

    2009-01-01

    Full Text Available Problem statement: Ultra Wide Band (UWB technology has attracted many researchers' attention due to its advantages and its great potential for future applications. The physical layer standard of Multi-band Orthogonal Frequency Division Multiplexing (MB-OFDM UWB system is defined by ECMA International. In this standard, the data sampling rate from the analog-to-digital converter to the physical layer is up to 528 M sample sec-1. Therefore, it is a challenge to realize the physical layer especially the components with high computational complexity in Very Large Scale Integration (VLSI implementation. Fast Fourier Transform (FFT block which plays an important role in MB-OFDM system is one of these components. Furthermore, the execution time of this module is only 312.5 ns. Therefore, if employing the traditional approach, high power consumption and hardware cost of the processor will be needed to meet the strict specifications of the UWB system. The objective of this study was to design an Application Specific Integrated Circuit (ASIC FFT processor for this system. The specification was defined from the system analysis and literature research. Approach: Based on the algorithm and architecture analysis, a novel Genetic Algorithm (GA based Canonical Signed Digit (CSD Multiplier less 128-point FFT processor and its inverse (IFFT for MB-OFDM UWB systems had been proposed. The proposed pipelined architecture was based on the modified Radix-22 algorithm that had same number of multipliers as that of the conventional Radix-22. However, the multiplication complexity and the ROM memory needed for storing twiddle factors coefficients could be eliminated by replacing the conventional complex multipliers with a newly proposed GA optimized CSD constant multipliers. The design had been coded in Verilog HDL and targeted Xilinx Virtex-II FPGA series. It was fully implemented and tested on real hardware using Virtex-II FG456 prototype board and logic analyzer

  12. Smart Multicore Embedded Systems

    DEFF Research Database (Denmark)

    This book provides a single-source reference to the state-of-the-art of high-level programming models and compilation tool-chains for embedded system platforms. The authors address challenges faced by programmers developing software to implement parallel applications in embedded systems, where very...... specificities of various embedded systems from different industries. Parallel programming tool-chains are described that take as input parameters both the application and the platform model, then determine relevant transformations and mapping decisions on the concrete platform, minimizing user intervention...... and hiding the difficulties related to the correct and efficient use of memory hierarchy and low level code generation. Describes tools and programming models for multicore embedded systems Emphasizes throughout performance per watt scalability Discusses realistic limits of software parallelization Enables...

  13. Multicore Education through Simulation

    Science.gov (United States)

    Ozturk, O.

    2011-01-01

    A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…

  14. Multicore Education through Simulation

    Science.gov (United States)

    Ozturk, O.

    2011-01-01

    A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…

  15. Multicore Based Open Loop Motor Controller Embedded System for Permanent Magnet Direct Current Motor

    Directory of Open Access Journals (Sweden)

    K. Baskaran

    2012-01-01

    Full Text Available Problem statement: In an advanced electronics world most of the applications are developed by microcontroller based embedded system. Approach: Multicore processor based motor controller was presented to improve the processing speed of the controller and improve the efficiency of the motor by maintaining constant speed. It was based on the combination of Cortex processor (Software core and Field Programmable Gate Arrays (FPGA, Hardware core. These multicore combination were help to design efficient low power motor controller. Results: A functional design of cortex processor and FPGA in this system was completed by using Actel libero IDE and IAR embedded IDE software PWM signal was generated by the proposed processor to control the motor driver circuit. All the function modules were programmed by Very-High-Speed Integrated Circuit Hardware Description Language (VHDL. The advantage of the proposed system was optimized operational performance and low power utility. Multicore processor was used to improve the speed of execution and optimize the performance of the controller. Conclusion: Without having the architectural concept of any motor we can control it by using this method.This is an low cost low power controller and easy to use. The simulation and experiment results verified its validity.

  16. FAST: framework for heterogeneous medical image computing and visualization.

    Science.gov (United States)

    Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank

    2015-11-01

    Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.

  17. Development of a Fast, Single-pass, Micron-resolution Beam Position Monitor Signal Processor: Beam Test Results from ATF2

    CERN Document Server

    Apsimon, Robert; Burrows, Philip; Christian, Glenn; Constance, Ben; Dabiri Khah, Hamid; Perry, Colin; Resta Lopez, Javier; Swinson, Christina

    2010-01-01

    We present the design of a stripline beam position monitor (BPM) signal processor with low latency (c. 10ns) and micron-level spatial resolution in single-pass mode. Such a BPM processor has applications in single-pass beamlines such as those at linear colliders and FELs. The processor was deployed and tested at the Accelerator Test Facility (ATF2) extraction line at KEK, Japan. We report the beam test results and processor performance, including response, linearity, spatial resolution and latency.

  18. Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

    Science.gov (United States)

    Hadade, Ioan; di Mare, Luca

    2016-08-01

    Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

  19. Thread mapping using system-level model for shared memory multicores

    Science.gov (United States)

    Mitra, Reshmi

    Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.

  20. Power Dissipation Challenges in Multicore Floating-Point Units

    DEFF Research Database (Denmark)

    Liu, Wei; Nannarelli, Alberto

    2010-01-01

    With increased densities on chips and the growing popularity of multicore processors and general-purpose graphics processing units (GPGPUs) power dissipation and energy consumption pose a serious challenge in the design of system-on-chips (SoCs) and a rise in costs for heat removal. In this work......, we analyze the impact of power dissipation in floating-point (FP) units and we consider different alternatives in the implementation of FP-division that lead to substantial energy savings. We compare the implementation of division in a Fused Multiply-Add (FMA) unit based on the Newton...

  1. The Multi-Core Era - Trends and Challenges

    CERN Document Server

    Tröger, Peter

    2008-01-01

    Since the very beginning of hardware development, computer processors were invented with ever-increasing clock frequencies and sophisticated in-build optimization strategies. Due to physical limitations, this 'free lunch' of speedup has come to an end. The following article gives a summary and bibliography for recent trends and challenges in CMP architectures. It discusses how 40 years of parallel computing research need to be considered in the upcoming multi-core era. We argue that future research must be driven from two sides - a better expression of hardware structures, and a domain-specific understanding of software parallelism.

  2. Synthetic Aperture Sequential Beamforming implemented on multi-core platforms

    DEFF Research Database (Denmark)

    Kjeldsen, Thomas; Lassen, Lee; Hemmsen, Martin Christian

    2014-01-01

    This paper compares several computational ap- proaches to Synthetic Aperture Sequential Beamforming (SASB) targeting consumer level parallel processors such as multi-core CPUs and GPUs. The proposed implementations demonstrate that ultrasound imaging using SASB can be executed in real- time...... with a significant headroom for post-processing. The CPU implementations are optimized using Single Instruction Multiple Data (SIMD) instruction extensions and multithreading, and the GPU computations are performed using the APIs, OpenCL and OpenGL. The implementations include refocusing (dynamic focusing) of a set...

  3. Note: High resolution ultra fast high-power pulse generator for inductive load using digital signal processor.

    Science.gov (United States)

    Flaxer, Eli

    2014-08-01

    We present a new design of a compact, ultra fast, high resolution and high-powered, pulse generator for inductive load, using power MOSFET, dedicated gate driver and a digital signal controller. This design is an improved circuit of our old version controller. We demonstrate the performance of this pulse generator as a driver for a new generation of high-pressure supersonic pulsed valves.

  4. Decimal Engine for Energy-Efficient Multicore Processors

    DEFF Research Database (Denmark)

    Nannarelli, Alberto

    2014-01-01

    propose a hybrid BFP/DFP engine to perform BFP division and DFP addition, multiplication and division. The main purpose of this engine is to offload the binary floating-point units for this type of operations and reduce the latency for decimal operations, and power and temperature for the whole die....

  5. A full parallel radix sorting algorithm for multicore processors

    OpenAIRE

    Maus, Arne

    2011-01-01

    The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen a...

  6. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira;

    2015-01-01

    , which is used to minimize dominance tests while maintaining high throughput. The algorithm uses an efficiently-updatable data structure over the shared, global skyline, based on point-based partitioning. Also, we release a large benchmark of optimized skyline algorithms, with which we demonstrate...

  7. Multi-core Architectures and Streaming Applications

    NARCIS (Netherlands)

    Smit, Gerard J.M.; Kokkeler, André B.J.; Wolkotte, Pascal T.; Burgwal, van de Marcel D.; Mandoiu, I.; Kennings, A.

    2008-01-01

    In this paper we focus on algorithms and reconfigurable multi-core architectures for streaming digital signal processing (DSP) applications. The multi-core concept has a number of advantages: (1) depending on the requirements more or fewer cores can be switched on/off, (2) the multi-core structure f

  8. Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

    KAUST Repository

    Malas, T.

    2015-07-02

    The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.

  9. Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

    Energy Technology Data Exchange (ETDEWEB)

    Levesque, John; Larkin, Jeff; Foster, Martyn; Glenski, Joe; Geissler, Garry; Whalen, Stephen; Waldecker, Brian; Carter, Jonathan; Skinner, David; He, Helen; Wasserman, Harvey; Shalf, John; Shan,Hongzhang; Strohmaier, Erich

    2007-03-07

    Over the past 15 years, microprocessor performance hasdoubled approximately every 18 months through increased clock rates andprocessing efficiency. In the past few years, clock frequency growth hasstalled, and microprocessor manufacturers such as AMD have moved towardsdoubling the number of cores every 18 months in order to maintainhistorical growth rates in chip performance. This document investigatesthe ramifications of multicore processor technology on the new Cray XT4?systems based on AMD processor technology. We begin by walking throughthe AMD single-core and dual-core and upcoming quad-core processorarchitectures. This is followed by a discussion of methods for collectingperformance counter data to understand code performance on the Cray XT3?and XT4? systems. We then use the performance counter data to analyze theimpact of multicore processors on the performance of microbenchmarks suchas STREAM, application kernels such as the NAS Parallel Benchmarks, andfull application codes that comprise the NERSC-5 SSP benchmark suite. Weexplore compiler options and software optimization techniques that canmitigate the memory bandwidth contention that can reduce computingefficiency on multicore processors. The last section provides a casestudy of applying the dual-core optimizations to the NAS ParallelBenchmarks to dramatically improve their performance.

  10. Using Multicore Technologies to Speed Up Complex Simulations of Population Evolution

    Directory of Open Access Journals (Sweden)

    Mauricio Guevara-Souza

    2013-01-01

    Full Text Available We explore with the use of multicore processing technologies for conducting simulations on population replacement of disease vectors. In our model, a native population of simulated vectors is inoculated with a small exogenous population of vectors that have been infected with the Wolbachia bacteria, which confers immunity to the disease. We conducted a series of computational simulations to study the conditions required by the invading population to take over the native population. Given the computational burden of this study, we decided to take advantage of modern multicore processor technologies for reducing the time required for the simulations. Overall, the results seem promising both in terms of the application and the use of multicore technologies.

  11. Optimization of automatically generated multi-core code for the LTE RACH-PD algorithm

    CERN Document Server

    Pelcat, Maxime; Nezan, Jean François

    2008-01-01

    Embedded real-time applications in communication systems require high processing power. Manual scheduling devel-oped for single-processor applications is not suited to multi-core architectures. The Algorithm Architecture Matching (AAM) methodology optimizes static application implementation on multi-core architectures. The Random Access Channel Preamble Detection (RACH-PD) is an algorithm for non-synchronized access of Long Term Evolu-tion (LTE) wireless networks. LTE aims to improve the spectral efficiency of the next generation cellular system. This paper de-scribes a complete methodology for implementing the RACH-PD. AAM prototyping is applied to the RACH-PD which is modelled as a Synchronous DataFlow graph (SDF). An efficient implemen-tation of the algorithm onto a multi-core DSP, the TI C6487, is then explained. Benchmarks for the solution are given.

  12. A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

    DEFF Research Database (Denmark)

    Li, Gang; Guan, Wei; Sierszecki, Krzysztof

    2012-01-01

    . This paper presents a multi-core architecture incorporating a hardware kernel on FPGAs, intended for high performance applications in control engineering domain. First, the hardware kernel is investigated on the basis of a component-based real-time kernel HARTEX (Hard Real-Time Executive for Control Systems......). Second, a heterogeneous multi-core architecture is investigated, focusing on its performance in relation to hard real-time constraints and predictable behavior. Third, the hardware implementation of HARTEX is designated to support the heterogeneous multi-core architecture. This hardware kernel has......Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints...

  13. Dealing with BIG Data - Exploiting the Potential of Multicore Parallelism and Auto-Tuning

    CERN Document Server

    CERN. Geneva

    2012-01-01

    Physics experiments nowadays produce tremendous amounts of data that require sophisticated analyses in order to gain new insights. At such large scale, scientists are facing non-trivial software engineering problems in addition to the physics problems. Ubiquitous multicore processors and GPGPUs have turned almost any computer into a parallel machine and have pushed compute clusters and clouds to become multicore-based and more heterogenous. These developments complicate the exploitation of various types of parallelism within different layers of hardware and software. As a consequence, manual performance tuning is non-intuitive and tedious due to the large search space spanned by numerous inter-related tuning parameters. This talk addresses these challenges at CERN and discusses how to leverage multicore parallelization techniques in this context. It presents recent advances in automatic performance tuning to algorithmically find sweet spots with good performance. The talk also presents results from empiri...

  14. MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    Energy Technology Data Exchange (ETDEWEB)

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2010-03-20

    This work studies the performance and scalability characteristics of"hybrid'" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  15. On Designing Multicore-aware Simulators for Biological Systems

    CERN Document Server

    Aldinucci, Marco; Damiani, Ferruccio; Drocco, Maurizio; Torquati, Massimo; Troina, Angelo

    2010-01-01

    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.

  16. Fast decision algorithms in low-power embedded processors for quality-of-service based connectivity of mobile sensors in heterogeneous wireless sensor networks.

    Science.gov (United States)

    Jaraíz-Simón, María D; Gómez-Pulido, Juan A; Vega-Rodríguez, Miguel A; Sánchez-Pérez, Juan M

    2012-01-01

    When a mobile wireless sensor is moving along heterogeneous wireless sensor networks, it can be under the coverage of more than one network many times. In these situations, the Vertical Handoff process can happen, where the mobile sensor decides to change its connection from a network to the best network among the available ones according to their quality of service characteristics. A fitness function is used for the handoff decision, being desirable to minimize it. This is an optimization problem which consists of the adjustment of a set of weights for the quality of service. Solving this problem efficiently is relevant to heterogeneous wireless sensor networks in many advanced applications. Numerous works can be found in the literature dealing with the vertical handoff decision, although they all suffer from the same shortfall: a non-comparable efficiency. Therefore, the aim of this work is twofold: first, to develop a fast decision algorithm that explores the entire space of possible combinations of weights, searching that one that minimizes the fitness function; and second, to design and implement a system on chip architecture based on reconfigurable hardware and embedded processors to achieve several goals necessary for competitive mobile terminals: good performance, low power consumption, low economic cost, and small area integration.

  17. Fast Decision Algorithms in Low-Power Embedded Processors for Quality-of-Service Based Connectivity of Mobile Sensors in Heterogeneous Wireless Sensor Networks

    Directory of Open Access Journals (Sweden)

    Juan M. Sánchez-Pérez

    2012-02-01

    Full Text Available When a mobile wireless sensor is moving along heterogeneous wireless sensor networks, it can be under the coverage of more than one network many times. In these situations, the Vertical Handoff process can happen, where the mobile sensor decides to change its connection from a network to the best network among the available ones according to their quality of service characteristics. A fitness function is used for the handoff decision, being desirable to minimize it. This is an optimization problem which consists of the adjustment of a set of weights for the quality of service. Solving this problem efficiently is relevant to heterogeneous wireless sensor networks in many advanced applications. Numerous works can be found in the literature dealing with the vertical handoff decision, although they all suffer from the same shortfall: a non-comparable efficiency. Therefore, the aim of this work is twofold: first, to develop a fast decision algorithm that explores the entire space of possible combinations of weights, searching that one that minimizes the fitness function; and second, to design and implement a system on chip architecture based on reconfigurable hardware and embedded processors to achieve several goals necessary for competitive mobile terminals: good performance, low power consumption, low economic cost, and small area integration.

  18. 基于ESL快速精确的处理器混合模型%Fast and Accurate Processor Hybrid Model Based on ESL

    Institute of Scientific and Technical Information of China (English)

    鲁超; 魏继增; 常轶松

    2012-01-01

    For RTL design cannot meet the requirement of the speed of System on Chip(SoC), this paper presents a fast and accurate processor hybrid model based on Electronic System Level(ESL). It uses the proprietary 32 bit embedded microprocessor C*CORE340 based on ESL design methodology. For this target the Instruction Set Simulator(ISS) and Cache adopt the different abstraction layers to construct. Experimental results show that the simulation speeds of hybrid model is a least 10 times faster than that of RTL model, and the less than 10% error rate simulation accuracy is gotten.%RTL设计不能满足片上系统对仿真速度的要求.为此,提出一种基于电子系统级快速精确的处理器混合模型.以32位嵌入式微处理器C*CORE340为例,采用不同的抽象层次对指令集仿真器和Cache进行构建.实验结果表明,与RTL级模型相比,该模型的仿真速度至少快10倍,仿真精度误差率低于10%.

  19. Electronic Structure Calculations and Adaptation Scheme in Multi-core Computing Environments

    Energy Technology Data Exchange (ETDEWEB)

    Seshagiri, Lakshminarasimhan; Sosonkina, Masha; Zhang, Zhao

    2009-05-20

    Multi-core processing environments have become the norm in the generic computing environment and are being considered for adding an extra dimension to the execution of any application. The T2 Niagara processor is a very unique environment where it consists of eight cores having a capability of running eight threads simultaneously in each of the cores. Applications like General Atomic and Molecular Electronic Structure (GAMESS), used for ab-initio molecular quantum chemistry calculations, can be good indicators of the performance of such machines and would be a guideline for both hardware designers and application programmers. In this paper we try to benchmark the GAMESS performance on a T2 Niagara processor for a couple of molecules. We also show the suitability of using a middleware based adaptation algorithm on GAMESS on such a multi-core environment.

  20. LDRD final report : a lightweight operating system for multi-core capability class supercomputers.

    Energy Technology Data Exchange (ETDEWEB)

    Kelly, Suzanne Marie; Hudson, Trammell B. (OS Research); Ferreira, Kurt Brian; Bridges, Patrick G. (University of New Mexico); Pedretti, Kevin Thomas Tauke; Levenhagen, Michael J.; Brightwell, Ronald Brian

    2010-09-01

    The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.

  1. PERFORMANCE EVALUATION OF DIRECT PROCESSOR ACCESS FOR NON DEDICATED SERVER

    Directory of Open Access Journals (Sweden)

    P. S. BALAMURUGAN

    2010-10-01

    Full Text Available The objective of the paper is to design a co processor for a desktop machine which enables the machine to act as non dedicated server, such that the co processor will act as a server processor and the multi-core processor to act as desktop processor. By implementing this methodology a client machine can be made to act as a non dedicated server and a client machine. These type of machine can be used in autonomy networks. This design will lead to design of a cost effective server and machine which can parallel act as a non dedicated server and a client machine or it can be made to switch and act as client or server.

  2. High performance parallelism pearls 2 multicore and many-core programming approaches

    CERN Document Server

    Jeffers, Jim

    2015-01-01

    High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t

  3. Pseudo-Random Number Generators for Vector Processors and Multicore Processors

    DEFF Research Database (Denmark)

    Fog, Agner

    2015-01-01

    Large scale Monte Carlo applications need a good pseudo-random number generator capable of utilizing both the vector processing capabilities and multiprocessing capabilities of modern computers in order to get the maximum performance. The requirements for such a generator are discussed. New ways...

  4. A multi-core CPU pipeline architecture for virtual environments.

    Science.gov (United States)

    Acosta, Eric; Liu, Alan; Sieck, Jennifer; Muniz, Gilbert; Bowyer, Mark; Armonda, Rocco

    2009-01-01

    Physically-based virtual environments (VEs) provide realistic interactions and behaviors for computer-based medical simulations. Limited CPU resources have traditionally forced VEs to be simplified for real-time performance. Multi-core processors greatly increase the computational capacity of computers and are quickly becoming standard. However, developing non-application specific methods to fully utilize all available CPU cores for processing VEs is difficult. The paper describes a pipeline VE architecture designed for multi-core CPU systems. The architecture enables development of VEs that leverage the computational resources of all CPU cores for VE simulation. A VE's workload is dynamically distributed across the available CPU cores. A VE can be developed once and scale efficiently with the number of cores. The described pipeline architecture makes it possible to develop complex physically-based VEs for medical simulations. Initial results for a craniotomy simulator being developed have shown super-linear and near-linear speedups when tested with up to four cores.

  5. Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Tumeo, Antonino; Villa, Oreste; Secchi, Simone; Chavarría-Miranda, Daniel

    2013-12-12

    String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearly proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are

  6. Hybrid programming model for implicit PDE simulations on multicore architectures

    KAUST Repository

    Kaushik, Dinesh K.

    2011-01-01

    The complexity of programming modern multicore processor based clusters is rapidly rising, with GPUs adding further demand for fine-grained parallelism. This paper analyzes the performance of the hybrid (MPI+OpenMP) programming model in the context of an implicit unstructured mesh CFD code. At the implementation level, the effects of cache locality, update management, work division, and synchronization frequency are studied. The hybrid model presents interesting algorithmic opportunities as well: the convergence of linear system solver is quicker than the pure MPI case since the parallel preconditioner stays stronger when hybrid model is used. This implies significant savings in the cost of communication and synchronization (explicit and implicit). Even though OpenMP based parallelism is easier to implement (with in a subdomain assigned to one MPI process for simplicity), getting good performance needs attention to data partitioning issues similar to those in the message-passing case. © 2011 Springer-Verlag.

  7. Developing Parallel Application on Multi-core Mobile Phone

    Directory of Open Access Journals (Sweden)

    Dhuha Basheer Abdullah

    2013-12-01

    Full Text Available One cannot imagine daily life today without mobile devices such as mobile phones or PDAs. They tend to become your mobile computer offering all features one might need on the way. As a result devices are less expensive and include a huge amount of high end technological components. Thus they also become attractive for scientific research. Today multi-core mobile phones are taking all the attention. Relying on the principles of tasks and data parallelism, we propose in this paper a real-time mobile lane departure warning system (M-LDWS based on a carefully designed parallel programming framework on a quad-core mobile phone, and show how to increase the utilization of processors to achieve improvement on the system’s runtime.

  8. Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Su, Chun-Yi [Virginia Polytechnic Inst. and State Univ. (Virginia Tech), Blacksburg, VA (United States)

    2014-12-16

    By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access

  9. A Fault Detection Mechanism in a Data-flow Scheduled Multithreaded Processor

    NARCIS (Netherlands)

    Fu, J.; Yang, Q.; Poss, R.; Jesshope, C.R.; Zhang, C.

    2014-01-01

    This paper designs and implements the Redundant Multi-Threading (RMT) in a Data-flow scheduled MultiThreaded (DMT) multicore processor, called Data-flow scheduled Redundant Multi-Threading (DRMT). Meanwhile, It presents Asynchronous Output Comparison (AOC) for RMT techniques to avoid fault detection

  10. Multicore in production: advantages and limits of the multiprocess approach in the ATLAS experiment

    Science.gov (United States)

    Binet, S.; Calafiura, P.; Jha, M. K.; Lavrijsen, W.; Leggett, C.; Lesny, D.; Severini, H.; Smith, D.; Snyder, S.; Tatarkhanov, M.; Tsulaia, V.; VanGemmeren, P.; Washbrook, A.

    2012-06-01

    The shared memory architecture of multicore CPUs provides HEP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelize HEP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm, which allowed us to achieve sharing of almost 80% of memory pages among event worker processes for certain types of reconstruction jobs with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to parallelize large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. The process of validating AthenaMP for production took ten months of concentrated effort and is expected to continue for several more months. Besides validating the software itself, an important and time-consuming aspect of running multicore applications in production was to configure the ATLAS distributed production system to handle multicore jobs. This entailed defining multicore batch queues, where the unit resource is not a core, but a whole computing node; monitoring the output of many event workers; and adapting the job definition layer to handle computing resources with different event throughputs. We will present scalability and memory usage studies, based on data gathered both on dedicated hardware and at the CERN Computer Center.

  11. StochKit-FF: Efficient Systems Biology on Multicore Architectures

    CERN Document Server

    Aldinucci, Marco; Liò, Pietro; Sorathiya, Anil; Torquati, Massimo

    2010-01-01

    The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores and exploits the novel concept of selective memory. We experiment StochKit-FF on a model of HIV infection dynamics, with the aim of extracting information from efficiently run experiments, here in terms of average and variance and, on a longer term, of more structured data.

  12. Robust motion estimation on a low-power multi-core DSP

    Science.gov (United States)

    Igual, Francisco D.; Botella, Guillermo; García, Carlos; Prieto, Manuel; Tirado, Francisco

    2013-12-01

    This paper addresses the efficient implementation of a robust gradient-based optical flow model in a low-power platform based on a multi-core digital signal processor (DSP). The aim of this work was to carry out a feasibility study on the use of these devices in autonomous systems such as robot navigation, biomedical assistance, or tracking, with not only power restrictions but also real-time requirements. We consider the C6678 DSP from Texas Instruments (Dallas, TX, USA) as the target platform of our implementation. The interest of this research is particularly relevant in optical flow scope because this system can be considered as an alternative solution for mid-range video resolutions when a combination of in-processor parallelism with optimizations such as efficient memory-hierarchy exploitation and multi-processor parallelization are applied.

  13. Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Shchur, L. N.; Andreichenko, V. B.; Dotsenko, Vl. S.

    We describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  14. Cluster algorithm special purpose processor

    Energy Technology Data Exchange (ETDEWEB)

    Talapov, A.L.; Shchur, L.N.; Andreichenko, V.B.; Dotsenko, V.S. (Landau Inst. for Theoretical Physics, GSP-1 117940 Moscow V-334 (USSR))

    1992-08-10

    In this paper, the authors describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  15. Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors

    Science.gov (United States)

    Sourouri, Mohammed; Birger Raknes, Espen

    2017-04-01

    In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most

  16. Cache Energy Optimization Techniques For Modern Processors

    Energy Technology Data Exchange (ETDEWEB)

    Mittal, Sparsh [ORNL

    2013-01-01

    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In this book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both

  17. Research of real-time wavefront reconstruction and control based on multi-core DSP

    Science.gov (United States)

    Wang, Zhui; Zhou, Luchun; Li, Mei; Zhang, Haotian

    2014-09-01

    In an Adaptive Optics system, the Real Time Processor is as important as the human brain. Processing latency is a key index of Real Time Proceesors . In this paper, we propose a new processing method that significantly reduce the processing latency, which combined the design idea of multi-core parallel processing on space and time. In addition, by comparing the operating speed of CPU and the I/O speed of memory, we propose a reasonable memory allocation scheme. The experimental results show that the processing latency is 59.7us per frame using multi-core DSP TMS320C6678 as processing platform. The experiment is conducted on a system with 968 sub-apertures and 913 actuators.

  18. Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

    Energy Technology Data Exchange (ETDEWEB)

    Camp, David; Garth, Christoph; Childs, Hank; Pugmire, Dave; Joy, Kenneth I.

    2010-11-01

    Streamline computation in a very large vector field data set represents a significant challenge due to the non-local and datadependentnature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programmingand execution as applied to streamline integration on a large, multicore platform. With multi-core processors now prevalent in clustersand supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice.We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize-over-seeds and parallelize-overblocks,and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing betweencores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication andI/O bandwidth than a traditional, non-hybrid distributed implementation.

  19. Streamline integration using MPI-hybrid parallelism on a large multicore architecture.

    Science.gov (United States)

    Camp, David; Garth, Christoph; Childs, Hank; Pugmire, Dave; Joy, Kenneth I

    2011-11-01

    Streamline computation in a very large vector field data set represents a significant challenge due to the nonlocal and data-dependent nature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programming and execution as applied to streamline integration on a large, multicore platform. With multicore processors now prevalent in clusters and supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice. We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize over seeds and parallelize over blocks, and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing between cores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication and I/O bandwidth than a traditional, nonhybrid distributed implementation.

  20. The Coupling Waves of Multicore-Fiber

    Institute of Scientific and Technical Information of China (English)

    2003-01-01

    Multicore-fiber as gain medium for fiber laser is introduced. The cores are coupled via evanescent waves. Analysis of the coupling waves is agree with the numerical simulations and experimental results.

  1. Multi-core fiber undersea transmission systems

    DEFF Research Database (Denmark)

    Nooruzzaman, Md; Morioka, Toshio

    2017-01-01

    Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components.......Various potential architectures of branching units for multi-core fiber undersea transmission systems are presented. It is also investigated how different architectures of branching unit influence the number of fibers and those of inline components....

  2. The Bulk Multicore Architecture for Improved Programmability

    Science.gov (United States)

    2009-12-01

    algorithm, forcing the same order of chunk commits as in the recording step. This design, which we call PicoLog , typically incurs a performance cost... PicoLog . Data-race detection at production- run speed. The Bulk Multicore can support an efficient data-race detec- tor based on the “happens-before...Bulk Multicore (a), with a possible OrderOnly execution log (b) and PicoLog execution log (c). contributed articles DECEMBER 2009 | VOL. 52

  3. Static and Dynamic Frequency Scaling on Multicore CPUs

    Energy Technology Data Exchange (ETDEWEB)

    Bao, Wenlei [The Ohio State University, Columbus, Ohio; Hong, Changwan [The Ohio State University, Columbus, Ohio; Chunduri, Sudheer [IBM Research India, S. Cass Avenue Lemont, IL; Krishnamoorthy, Sriram [Pacific Northwest National Laboratory, Richland, WA; Pouchet, Louis-Noël [Colorado State University, Fort Collins, CO; Rastello, Fabrice [University Grenoble Alpes, Grenoble France; Sadayappan, P. [The Ohio State University, Columbus, Ohio

    2016-12-28

    Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload, that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.

  4. Design for Testability Features of Godson-3 Multicore Microprocessor

    Institute of Scientific and Technical Information of China (English)

    Zi-Chu Qi; Hui Liu; Xiang-Ku Li; Wei-Wu Hu

    2011-01-01

    This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and targets high-end applications. Advanced techniques are adopted to make the DFT design scalable and achieve low-power and low-cost test with limited IO resources. To achieve a scalable and flexible test access, a highly elaborate test access mechanism (TAM) is implemented to support multiple test instructions and test modes. Taking advantage of multiple identical cores embedding in the processor, scan partition and on-chip comparisons are employed to reduce test power and test time. Test compression technique is also utilized to decrease test time. To further reduce test power, clock controlling logics are designed with ability to turn off clocks of non-testing partitions. In addition, scan collars of CACHEs are designed to perform functional test with low-speed ATE for speed-binning purposes, which poses low complexity and has good correlation results.

  5. 用DSP实现MPEG音频层III压缩的加速方法%Fast Method for Compression in MPEG Audio Layer Ⅲ by Digital Signal Processor

    Institute of Scientific and Technical Information of China (English)

    窦维蓓; 阳学仕; 董在望

    1999-01-01

    MPEG音频层III压缩算法,是由ISO11172-3标准规定的一种高效、高保真的压缩编码算法.由于层III压缩算法的复杂度高,运算量大,为此提出了在实时应用中,基于数字信号处理器(Digital Signal Processor,以下简称DSP)实现层III压缩算法的关键运算的加速措施.

  6. Safety-critical Java on a time-predictable processor

    DEFF Research Database (Denmark)

    Korsholm, Stephan E.; Schoeberl, Martin; Puffitsch, Wolfgang

    2015-01-01

    For real-time systems the whole execution stack needs to be time-predictable and analyzable for the worst-case execution time (WCET). This paper presents a time-predictable platform for safety-critical Java. The platform consists of (1) the Patmos processor, which is a time-predictable processor......; (2) a C compiler for Patmos with support for WCET analysis; (3) the HVM, which is a Java-to-C compiler; (4) the HVM-SCJ implementation which supports SCJ Level 0, 1, and 2 (for both single and multicore platforms); and (5) a WCET analysis tool. We show that real-time Java programs translated to C...... and compiled to a Patmos binary can be analyzed by the AbsInt aiT WCET analysis tool. To the best of our knowledge the presented system is the second WCET analyzable real-time Java system; and the first one on top of a RISC processor....

  7. Safety-Critical Java on a Time-predictable Processor

    DEFF Research Database (Denmark)

    Korsholm, Stephan Erbs; Schoeberl, Martin; Puffitsch, Wolfgang

    2015-01-01

    For real-time systems the whole execution stack needs to be time-predictable and analyzable for the worst-case execution time (WCET). This paper presents a time-predictable platform for safety-critical Java. The platform consists of (1) the Patmos processor, which is a time-predictable processor......; (2) a C compiler for Patmos with support for WCET analysis; (3) the HVM, which is a Java-to-C compiler; (4) the HVM-SCJ implementation which supports SCJ Level 0, 1, and 2 (for both single and multicore platforms); and (5) a WCET analysis tool. We show that real-time Java programs translated to C...... and compiled to a Patmos binary can be analyzed by the AbsInt aiT WCET analysis tool. To the best of our knowledge the presented system is the second WCET analyzable real-time Java system; and the first one on top of a RISC processor....

  8. Solving Matrix Equations on Multi-Core and Many-Core Architectures

    Directory of Open Access Journals (Sweden)

    Peter Benner

    2013-11-01

    Full Text Available We address the numerical solution of Lyapunov, algebraic and differential Riccati equations, via the matrix sign function, on platforms equipped with general-purpose multicore processors and, optionally, one or more graphics processing units (GPUs. In particular, we review the solvers for these equations, as well as the underlying methods, analyze their concurrency and scalability and provide details on their parallel implementation. Our experimental results show that this class of hardware provides sufficient computational power to tackle large-scale problems, which only a few years ago would have required a cluster of computers.

  9. Evaluating the scalability of HEP software and multi-core hardware

    CERN Document Server

    Jarp, S; Leduc, J; Nowak, A

    2011-01-01

    As researchers have reached the practical limits of processor performance improvements by frequency scaling, it is clear that the future of computing lies in the effective utilization of parallel and multi-core architectures. Since this significant change in computing is well underway, it is vital for HEP programmers to understand the scalability of their software on modern hardware and the opportunities for potential improvements. This work aims to quantify the benefit of new mainstream architectures to the HEP community through practical benchmarking on recent hardware solutions, including the usage of parallelized HEP applications.

  10. Exploiting multicore compute resources in the CMS experiment

    Science.gov (United States)

    Ramírez, J. E.; Pérez-Calero Yzquierdo, A.; Hernández, J. M.; CMS Collaboration

    2016-10-01

    CMS has developed a strategy to efficiently exploit the multicore architecture of the compute resources accessible to the experiment. A coherent use of the multiple cores available in a compute node yields substantial gains in terms of resource utilization. The implemented approach makes use of the multithreading support of the event processing framework and the multicore scheduling capabilities of the resource provisioning system. Multicore slots are acquired and provisioned by means of multicore pilot agents which internally schedule and execute single and multicore payloads. Multicore scheduling and multithreaded processing are currently used in production for online event selection and prompt data reconstruction. More workflows are being adapted to run in multicore mode. This paper presents a review of the experience gained in the deployment and operation of the multicore scheduling and processing system, the current status and future plans.

  11. Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors

    NARCIS (Netherlands)

    Grelck, C.; Herhut, S.; Jesshope, C.; Joslin, C.; Lankamp, M.; Scholz, S.-B.; Shafarenko, A.

    2009-01-01

    We present preliminary results from compiling the high-level, functional and data-parallel programming language SaC into a novel multi-core design: Microgrids of Self-Adaptive Virtual Processors (SVPs). The side-effect free nature of SaC in conjunction with its data-parallel foundation make it an id

  12. Fault Tolerance Middleware for a Multi-Core System

    Science.gov (United States)

    Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

    2012-01-01

    Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the

  13. Multicore-based 3D-DWT video encoder

    Science.gov (United States)

    Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Migallón, Hector

    2013-12-01

    Three-dimensional wavelet transform (3D-DWT) encoders are good candidates for applications like professional video editing, video surveillance, multi-spectral satellite imaging, etc. where a frame must be reconstructed as quickly as possible. In this paper, we present a new 3D-DWT video encoder based on a fast run-length coding engine. Furthermore, we present several multicore optimizations to speed-up the 3D-DWT computation. An exhaustive evaluation of the proposed encoder (3D-GOP-RL) has been performed, and we have compared the evaluation results with other video encoders in terms of rate/distortion (R/D), coding/decoding delay, and memory consumption. Results show that the proposed encoder obtains good R/D results for high-resolution video sequences with nearly in-place computation using only the memory needed to store a group of pictures. After applying the multicore optimization strategies over the 3D DWT, the proposed encoder is able to compress a full high-definition video sequence in real-time.

  14. Multicore Processing and ARTEMIS - An incentive to develop the European Multiprocessor research

    DEFF Research Database (Denmark)

    Seceleanu, Tiberius; Tenhunen, Hannu; Jerraya, Ahmed

    2006-01-01

    Even though multiprocessor architectures have been developed for a long time now, the approach was mostly focusing on multi-chip realizations. Clustering computers or micro-processors on the same board was the solution to manage complex applications vs. performance requirements. It is only...... in the recent period that technological advances allow for a change of this paradigm towards on-chip distributed platforms, or multi-core, or multi processor system-on-chip (MPSOC). A multiprocessor architecture may be defined as: onchip clusters of heterogeneous functionality modules, cooperating...... to a traditional SOC view, concurrency at all levels plays a deterministic role, while problems such as power consumption, addressable separately in the nodes of a DS, must be unitary considered. Thus, distinct research and development issues must be defined for MPSOC, building on the indispensable experience...

  15. Highly scalable linear solvers on thousands of processors.

    Energy Technology Data Exchange (ETDEWEB)

    Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

    2009-09-01

    In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

  16. High performance deformable image registration algorithms for manycore processors

    CERN Document Server

    Shackleford, James; Sharp, Gregory

    2013-01-01

    High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E

  17. Median and Morphological Specialized Processors for a Real-Time Image Data Processing

    Directory of Open Access Journals (Sweden)

    Kazimierz Wiatr

    2002-01-01

    Full Text Available This paper presents the considerations on selecting a multiprocessor MISD architecture for fast implementation of the vision image processing. Using the author′s earlier experience with real-time systems, implementing of specialized hardware processors based on the programmable FPGA systems has been proposed in the pipeline architecture. In particular, the following processors are presented: median filter and morphological processor. The structure of a universal reconfigurable processor developed has been proposed as well. Experimental results are presented as delays on LCA level implementation for median filter, morphological processor, convolution processor, look-up-table processor, logic processor and histogram processor. These times compare with delays in general purpose processor and DSP processor.

  18. Markov Decision Process Based Energy-Efficient On-Line Scheduling for Slice-Parallel Video Decoders on Multicore Systems

    CERN Document Server

    Mastronarde, Nicholas; Atienza, David; Frossard, Pascal; van der Schaar, Mihaela

    2011-01-01

    We consider the problem of energy-efficient on-line scheduling for slice-parallel video decoders on multicore systems. We assume that each of the processors are Dynamic Voltage Frequency Scaling (DVFS) enabled such that they can independently trade off performance for power, while taking the video decoding workload into account. In the past, scheduling and DVFS policies in multi-core systems have been formulated heuristically due to the inherent complexity of the on-line multicore scheduling problem. The key contribution of this report is that we rigorously formulate the problem as a Markov decision process (MDP), which simultaneously takes into account the on-line scheduling and per-core DVFS capabilities; the separate power consumption of the processor cores and caches; and the loss tolerant and dynamic nature of the video decoder's traffic. In particular, we model the video traffic using a Direct Acyclic Graph (DAG) to capture the precedence constraints among frames in a Group of Pictures (GOP) structure, ...

  19. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    Energy Technology Data Exchange (ETDEWEB)

    Liao, C; Quinlan, D J; Willcock, J J; Panas, T

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

  20. Optimization of the coherence function estimation for multi-core central processing unit

    Science.gov (United States)

    Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

    2017-02-01

    The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.

  1. A Synthesizable Multicore Platform for Microwave Imaging

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; Karlsson, Sven

    2014-01-01

    of one cycle per network hop and attain a high clock frequency by pipelining the feedback loop to manage contention. We implement a multicore configuration with 48 cores and achieve a clock frequency as high as 300 MHz with a peak switching data rate of 9.6 Gbits/s per link on state-of-the-art FPGAs....

  2. Reconfigurable Multicore Architectures for Streaming Applications

    NARCIS (Netherlands)

    Smit, Gerardus Johannes Maria; Kokkeler, Andre B.J.; Rauwerda, G.K.; Jacobs, J.W.M.; Nicolescu, G.; Mosterman, P.J.

    2009-01-01

    This chapter addresses reconfigurable heterogenous and homogeneous multicore system-on-chip (SoC) platforms for streaming digital signal processing applications, also called DSP applications. In streaming DSP applications, computations can be specified as a data flow graph with streams of data items

  3. Distributed radiofrequency signal processing using multicore fibers

    Science.gov (United States)

    Garcia, S.; Gasulla, I.

    2016-11-01

    Next generation fiber-wireless communication paradigms will require new technologies to address the current limitations to massive capacity, connectivity and flexibility. Multicore optical fibers, which were conceived for high-capacity digital communications, can bring numerous advantages to fiber-wireless radio access architectures. Besides radio over fiber parallel distribution and multiple antenna connectivity, multicore fibers can implement, at the same time, a variety of broadband processing functionalities for microwave and millimeter-wave signals. This approach leads to the novel concept of "fiber-distributed signal processing". In particular, we capitalize on the spatial parallelism inherent to multicore fibers to implement a broadband tunable true time delay line, which is the basis of multiple processing applications such as signal filtering, arbitrary waveform generation and squint-free radio beamsteering. We present the design of trench-assisted heterogeneous multicore fibers composed of cores featuring individual spectral group delays and chromatic dispersion profiles. Besides fulfilling the requirements for true time delay line operation, the MCFs are optimized in terms of higher-order dispersion, crosstalk and bend sensitivity. Microwave photonics signal processing will benefit from the performance stability, 2D operation versatility and compactness brought by the reported fiberintegrated solution.

  4. Multicore in Production: Advantages and Limits of the Multiprocess Approach

    CERN Document Server

    Binet, S; The ATLAS collaboration; Lavrijsen, W; Leggett, Ch; Lesny, D; Jha, M K; Severini, H; Smith, D; Snyder, S; Tatarkhanov, M; Tsulaia, V; van Gemmeren, P; Washbrook, A

    2011-01-01

    The shared memory architecture of multicore CPUs provides HENP developers with the opportunity to reduce the memory footprint of their applications by sharing memory pages between the cores in a processor. ATLAS pioneered the multi-process approach to parallelizing HENP applications. Using Linux fork() and the Copy On Write mechanism we implemented a simple event task farm which allows to share up to 50% memory pages among event worker processes with negligible CPU overhead. By leaving the task of managing shared memory pages to the operating system, we have been able to run in parallel large reconstruction and simulation applications originally written to be run in a single thread of execution with little to no change to the application code. In spite of this, the process of validating athena multi-process for production took ten months of concentrated effort and is expected to continue for several more months. In general terms, we had two classes of problems in the multi-process port: merging the output fil...

  5. Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors

    OpenAIRE

    Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent

    2013-01-01

    Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...

  6. Embedded Processor Laboratory

    Data.gov (United States)

    Federal Laboratory Consortium — The Embedded Processor Laboratory provides the means to design, develop, fabricate, and test embedded computers for missile guidance electronics systems in support...

  7. Broadband monitoring simulation with massively parallel processors

    Science.gov (United States)

    Trubetskov, Mikhail; Amotchkina, Tatiana; Tikhonravov, Alexander

    2011-09-01

    Modern efficient optimization techniques, namely needle optimization and gradual evolution, enable one to design optical coatings of any type. Even more, these techniques allow obtaining multiple solutions with close spectral characteristics. It is important, therefore, to develop software tools that can allow one to choose a practically optimal solution from a wide variety of possible theoretical designs. A practically optimal solution provides the highest production yield when optical coating is manufactured. Computational manufacturing is a low-cost tool for choosing a practically optimal solution. The theory of probability predicts that reliable production yield estimations require many hundreds or even thousands of computational manufacturing experiments. As a result reliable estimation of the production yield may require too much computational time. The most time-consuming operation is calculation of the discrepancy function used by a broadband monitoring algorithm. This function is formed by a sum of terms over wavelength grid. These terms can be computed simultaneously in different threads of computations which opens great opportunities for parallelization of computations. Multi-core and multi-processor systems can provide accelerations up to several times. Additional potential for further acceleration of computations is connected with using Graphics Processing Units (GPU). A modern GPU consists of hundreds of massively parallel processors and is capable to perform floating-point operations efficiently.

  8. FAST

    DEFF Research Database (Denmark)

    Zuidmeer-Jongejan, Laurian; Fernandez-Rivas, Montserrat; Poulsen, Lars K.

    2012-01-01

    ABSTRACT: The FAST project (Food Allergy Specific Immunotherapy) aims at the development of safe and effective treatment of food allergies, targeting prevalent, persistent and severe allergy to fish and peach. Classical allergen-specific immunotherapy (SIT), using subcutaneous injections with aqu......ABSTRACT: The FAST project (Food Allergy Specific Immunotherapy) aims at the development of safe and effective treatment of food allergies, targeting prevalent, persistent and severe allergy to fish and peach. Classical allergen-specific immunotherapy (SIT), using subcutaneous injections...... with aqueous food extracts may be effective but has proven to be accompanied by too many anaphylactic side-effects. FAST aims to develop a safe alternative by replacing food extracts with hypoallergenic recombinant major allergens as the active ingredients of SIT. Both severe fish and peach allergy are caused...... in depth serological and cellular immune analyses will be performed, allowing identification of novel biomarkers for monitoring treatment efficacy. FAST aims at improving the quality of life of food allergic patients by providing a safe and effective treatment that will significantly lower their threshold...

  9. The UA1 trigger processor

    CERN Document Server

    Grayer, G H

    1981-01-01

    Experiment UA1 is a large multipurpose spectrometer at the CERN proton-antiproton collider. The principal trigger is formed on the basis of the energy deposition in calorimeters. A trigger decision taken in under 2.4 microseconds can avoid dead-time losses due to the bunched nature of the beam. To achieve this fast 8-bit charge to digital converters have been built followed by two identical digital processors tailored to the experiment. The outputs of groups of the 2440 photomultipliers in the calorimeters are summed to form a total of 288 input channels to the ADCs. A look-up table in RAM is used to convert the digitised photomultiplier signals to energy in one processor, and to transverse energy in the other. Each processor forms four sums from a chosen combination of input channels, and also counts the number of clusters with electromagnetic or hadronic energy above pre-determined levels. Up to twelve combinations of these conditions, together with external information, may be combined in coincidence or in...

  10. Array processors in chemistry

    Energy Technology Data Exchange (ETDEWEB)

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  11. HardwareSoftware Co-design for Heterogeneous Multi-core Platforms The hArtes Toolchain

    CERN Document Server

    2012-01-01

    This book describes the results and outcome of the FP6 project, known as hArtes, which focuses on the development of an integrated tool chain targeting a heterogeneous multi core platform comprising of a general purpose processor (ARM or powerPC), a DSP (the diopsis) and an FPGA. The tool chain takes existing source code and proposes transformations and mappings such that legacy code can easily be ported to a modern, multi-core platform. Benefits of the hArtes approach, described in this book, include: Uses a familiar programming paradigm: hArtes proposes a familiar programming paradigm which is compatible with the widely used programming practice, irrespective of the target platform. Enables users to view multiple cores as a single processor: the hArtes approach abstracts away the heterogeneity as well as the multi-core aspect of the underlying hardware so the developer can view the platform as consisting of a single, general purpose processor. Facilitates easy porting of existing applications: hArtes provid...

  12. Implementing High Performance Lexical Analyzer using CELL Broadband Engine Processor

    Directory of Open Access Journals (Sweden)

    P.J.SATHISH KUMAR

    2011-09-01

    Full Text Available The lexical analyzer is the first phase of the compiler and commonly the most time consuming. The compilation of large programs is still far from optimized in today’s compilers. With modern processors moving more towards improving parallelization and multithreading, it has become impossible for performance gains in older compilersas technology advances. Any multicore architecture relies on improving parallelism than on improving single core performance. A compiler that is completely parallel and optimized is yet to be developed and would require significant effort to create. On careful analysis we find that the performance of a compiler is majorly affected by the lexical analyzer’s scanning and tokenizing phases. This effort is directed towards the creation of a completelyparallelized lexical analyzer designed to run on the Cell/B.E. processor that utilizes its multicore functionalities to achieve high performance gains in a compiler. Each SPE reads a block of data from the input and tokenizes them independently. To prevent dependence of SPE’s, a scheme for dynamically extending static block-limits isincorporated. Each SPE is given a range which it initially scans and then finalizes its input buffer to a set of complete tokens from the range dynamically. This ensures parallelization of the SPE’s independently and dynamically, with the PPE scheduling load for each SPE. The initially static assignment of the code blocks is made dynamic as soon as one SPE commits. This aids SPE load distribution and balancing. The PPE maintains the output buffer until all SPE’s of a single stage commit and move to the next stage before being written out to the file, to maintain order of execution. The approach can be extended easily to other multicore architectures as well. Tokenization is performed by high-speed string searching, with the keyword dictionary of the language, using Aho-Corasick algorithm.

  13. The Multicore-Aware Data Transfer Middleware Project

    Energy Technology Data Exchange (ETDEWEB)

    Wu, Wenji [Fermilab; Zhang, L. [Fermilab; DeMar, P. [Fermilab; Li, Tan [Stony Brook U.; Ren, Y. [Stony Brook U.; Jin, S. [stony Brook U.; Yu, D. [Brookhaven

    2014-10-30

    Existing data movement tools are still bound by major inefficiencies when running on multicore systems. To address these inefficiencies and limitations, DOE’s Advanced Scientific Computing Research (ASCR) office has funded Fermilab and Brookhaven National Laboratory to collaboratively work on the Multicore-Aware Data Transfer Middleware (MDTM) project. MDTM aims to accelerate data movement toolkits on multicore systems. A prototype version of MDTM is currently undergoing evaluation and enhancement.

  14. MPC Related Computational Capabilities of ARMv7A Processors

    DEFF Research Database (Denmark)

    Frison, Gianluca; Jørgensen, John Bagterp

    2015-01-01

    In recent years, the mass market of mobile devices has pushed the demand for increasingly fast but cheap processors. ARM, the world leader in this sector, has developed the Cortex-A series of processors with focus on computationally intensive applications. If properly programmed, these processors...... are powerful enough to solve the complex optimization problems arising in MPC in real-time, while keeping the traditional low-cost and low-power consumption. This makes these processors ideal candidates for use in embedded MPC. In this paper, we investigate the floating-point capabilities of Cortex A7, A9...

  15. Multicore-periphery structure in networks

    CERN Document Server

    Yan, Bowen

    2016-01-01

    Many real-world networked systems exhibit a multicore-periphery structure, i.e., multiple cores, each of which contains densely connected elements, surrounded by sparsely connected elements that define the periphery. Identification of the multiple-periphery structure can provide a new handle on structures and functions of various complex networks, such as cognitive and biological networks, food webs, social networks, and communication and transportation networks. However, still no quantitative method exists to identify the multicore-periphery structure embedded in networks. Prior studies on core-periphery structure focused on either dichotomous or continuous division of a network into a single core and a periphery, whereas community detection algorithms did not discern the periphery from dense cohesive communities. Herein, we introduce a method to identify the optimal partition of a network into multiple dense cores and a loosely-connected periphery, and test the method on a well-known social network and the ...

  16. Multi-core operations setup in Modern PCs under Linux

    OpenAIRE

    相山, 長和; Aiyama, Toshikazu

    2012-01-01

    Most of the modern personal computers has a central processing unit which is multi-cores, and its graphic processing unit is also multi-cores. We will present here a detailed installation process ofenvironment which enables us to explit multi-core for each architecture. Our special attentions are 64-bit, and double precisions in Linux. The main purpose of this paper is to show a framework for multi-threaded, and multi-cores operations in both central processing unit, and graphical processing ...

  17. MultIMA- Multi-Core in Integrated Modular Avionics

    Science.gov (United States)

    Silva, Claudio; Tatibana, Cassia

    2014-08-01

    Multi-core technologies are the natural trend towards fulfilling recent space applications requirements. However, the adoption of multi-core implies increased complexity that must be addressed by application redesign or the implementation of explicit supporting mechanisms. GMV investigates multi-core and Integrated Modular Avionics as cooperative vehicles to achieve reliable support for future safety critical applications. In this paper, we describe the main challenges met in our investigations and how multi-core solutions were implemented in GMV's IMA simulator (SIMA) and operating system (AIR).

  18. Scientific computing with multicore and accelerators

    CERN Document Server

    Kurzak, Jakub; Dongarra, Jack

    2010-01-01

    Dense Linear Algebra Implementing Matrix Multiplication on the Cell B.E, Wesley Alvaro, Jakub Kurzak, and Jack DongarraImplementing Matrix Factorizations on the Cell BE, Jakub Kurzak and Jack DongarraDense Linear Algebra for Hybrid GPU-Based Systems, Stanimire Tomov and Jack DongarraBLAS for GPUs, Rajib Nath, Stanimire Tomov, and Jack DongarraSparse Linear Algebra Sparse Matrix-Vector Multiplication on Multicore and Accelerators, Samuel Williams, Nathan B

  19. Benchmarking a DSP processor

    OpenAIRE

    Lennartsson, Per; Nordlander, Lars

    2002-01-01

    This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. The algorithms were programm...

  20. A generic and compositional framework for multicore response time analysis

    NARCIS (Netherlands)

    Altmeyer, S.; Davis, R.I.; Indrusiak, L.; Maiza, C.; Nelis, V.; Reineke, J.

    2015-01-01

    In this paper, we introduce a Multicore Response Time Analysis (MRTA) framework. This framework is extensible to different multicore architectures, with various types and arrangements of local memory, and different arbitration policies for the common interconnects. We instantiate the framework for s

  1. Real time processor for array speckle interferometry

    Science.gov (United States)

    Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

    1989-01-01

    The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.

  2. T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.

    Science.gov (United States)

    Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

    2016-07-08

    Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.

  3. An Energy-Aware Runtime Management of Multi-Core Sensory Swarms

    Directory of Open Access Journals (Sweden)

    Sungchan Kim

    2017-08-01

    Full Text Available In sensory swarms, minimizing energy consumption under performance constraint is one of the key objectives. One possible approach to this problem is to monitor application workload that is subject to change at runtime, and to adjust system configuration adaptively to satisfy the performance goal. As today’s sensory swarms are usually implemented using multi-core processors with adjustable clock frequency, we propose to monitor the CPU workload periodically and adjust the task-to-core allocation or clock frequency in an energy-efficient way in response to the workload variations. In doing so, we present an online heuristic that determines the most energy-efficient adjustment that satisfies the performance requirement. The proposed method is based on a simple yet effective energy model that is built upon performance prediction using IPC (instructions per cycle measured online and power equation derived empirically. The use of IPC accounts for memory intensities of a given workload, enabling the accurate prediction of execution time. Hence, the model allows us to rapidly and accurately estimate the effect of the two control knobs, clock frequency adjustment and core allocation. The experiments show that the proposed technique delivers considerable energy saving of up to 45%compared to the state-of-the-art multi-core energy management technique.

  4. CRBLASTER: a fast parallel-processing program for cosmic ray rejection

    Science.gov (United States)

    Mighell, Kenneth J.

    2008-08-01

    Many astronomical image-analysis programs are based on algorithms that can be described as being embarrassingly parallel, where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of todays fast multi-core servers costing a few thousands of dollars. A major reason for the shortage of state-of-the-art parallel-processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called crblaster which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. crblaster is written in C using the industry standard Message Passing Interface (MPI) library. Processing a single 800×800 HST WFPC2 image takes 1.87 seconds using 4 processes on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 processors is 82%. The code can be used as a software framework for easy development of parallel-processing image-anlaysis programs using embarrassing parallel algorithms; the biggest required modification is the replacement of the core image processing function with an alternative image-analysis function based on a single-processor algorithm. I describe the design, implementation and performance of the program.

  5. Real-time Vision using FPGAs, GPUs and Multi-core CPUs

    DEFF Research Database (Denmark)

    Kjær-Nielsen, Anders

    the introduction and evolution of a wide variety of powerful hardware architectures have made the developed theory more applicable in performance demanding and real-time applications. Three different architectures have dominated the field due to their parallel capabilities that are often desired when dealing...... processors in the vision community. The introduction of programming languages like CUDA from NVIDIA has made it easier to utilize the high parallel processing powers of the GPU for general purpose computing and thereby realistic to use based on the effort involved with development. The increased clock......-linear filtering processes on FPGAs that has been used for preprocessing images in the context of a bigger Early Cognitive Vision (ECV) system. With the introduction of GPUs for general purpose computing the preprocessing was re-implemented on this architecture and used together with a multi-core CPU to form...

  6. Multi-Core DSP Based Parallel Architecture for FMCW SAR Real-Time Imaging

    Directory of Open Access Journals (Sweden)

    C. F. Gu

    2015-12-01

    Full Text Available This paper presents an efficient parallel processing architecture using multi-core Digital Signal Processor (DSP to improve the capability of real-time imaging for Frequency Modulated Continuous Wave Synthetic Aperture Radar (FMCW SAR. With the application of the proposed processing architecture, the imaging algorithm is modularized, and each module is efficiently realized by the proposed processing architecture. In each module, the data processing of different cores is executed in parallel, also the data transmission and data processing of each core are synchronously carried out, so that the processing time for SAR imaging is reduced significantly. Specifically, the time of corner turning operation, which is very time-consuming, is ignored under computationally intensive case. The proposed parallel architecture is applied to a compact Ku-band FMCW SAR prototype to achieve real-time imageries with 34 cm x 51 cm (range x azimuth resolution.

  7. Accelerating patch-based directional wavelets with multicore parallel computing in compressed sensing MRI.

    Science.gov (United States)

    Li, Qiyue; Qu, Xiaobo; Liu, Yunsong; Guo, Di; Lai, Zongying; Ye, Jing; Chen, Zhong

    2015-06-01

    Compressed sensing MRI (CS-MRI) is a promising technology to accelerate magnetic resonance imaging. Both improving the image quality and reducing the computation time are important for this technology. Recently, a patch-based directional wavelet (PBDW) has been applied in CS-MRI to improve edge reconstruction. However, this method is time consuming since it involves extensive computations, including geometric direction estimation and numerous iterations of wavelet transform. To accelerate computations of PBDW, we propose a general parallelization of patch-based processing by taking the advantage of multicore processors. Additionally, two pertinent optimizations, excluding smooth patches and pre-arranged insertion sort, that make use of sparsity in MR images are also proposed. Simulation results demonstrate that the acceleration factor with the parallel architecture of PBDW approaches the number of central processing unit cores, and that pertinent optimizations are also effective to make further accelerations. The proposed approaches allow compressed sensing MRI reconstruction to be accomplished within several seconds.

  8. Efficient Co-Simulation of Multicore Systems

    DEFF Research Database (Denmark)

    Brock-Nannestad, Laust; Karlsson, Sven

    2011-01-01

    the hardware state of a multicore design while it is running on an FPGA. With minimal changes to the design and using only the built-in JTAG programming and debug- ging facilities, we describe how to transfer the state from an FPGA to a simulator. We also show how the state can be transferred back from...... the simulator to FPGA. Given that the design runs in real-time on the FPGA, the end result is speed improvements of orders of magnitude over traditional pure software simulation....

  9. Widefield lensless endoscopy with a multicore fiber

    CERN Document Server

    Tsvirkun, Viktor; Bouwmans, Géraud; Katz, Ori; Andresen, Esben Ravn; Rigneault, Hervé

    2016-01-01

    We demonstrate pixelation-free real-time widefield endoscopic imaging through an aperiodic multicore fiber (MCF) without any distal opto-mechanical elements or proximal scanners. Exploiting the memory effect in MCFs the images in our system are directly obtained without any post-processing using a static wavefront correction obtained from a single calibration procedure. Our approach allows for video-rate 3D widefield imaging of incoherently illuminated objects with imaging speed not limited by the wavefront shaping device refresh rate.

  10. Hardware multiplier processor

    Science.gov (United States)

    Pierce, Paul E.

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  11. Signal processor packaging design

    Science.gov (United States)

    McCarley, Paul L.; Phipps, Mickie A.

    1993-10-01

    The Signal Processor Packaging Design (SPPD) program was a technology development effort to demonstrate that a miniaturized, high throughput programmable processor could be fabricated to meet the stringent environment imposed by high speed kinetic energy guided interceptor and missile applications. This successful program culminated with the delivery of two very small processors, each about the size of a large pin grid array package. Rockwell International's Tactical Systems Division in Anaheim, California developed one of the processors, and the other was developed by Texas Instruments' (TI) Defense Systems and Electronics Group (DSEG) of Dallas, Texas. The SPPD program was sponsored by the Guided Interceptor Technology Branch of the Air Force Wright Laboratory's Armament Directorate (WL/MNSI) at Eglin AFB, Florida and funded by SDIO's Interceptor Technology Directorate (SDIO/TNC). These prototype processors were subjected to rigorous tests of their image processing capabilities, and both successfully demonstrated the ability to process 128 X 128 infrared images at a frame rate of over 100 Hz.

  12. The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

    CERN Document Server

    Biagioni, Andrea; Lonardo, Alessandro; Paolucci, Pier Stanislao; Perra, Mersia; Rossetti, Davide; Sidore, Carlo; Simula, Francesco; Tosoratto, Laura; Vicini, Piero

    2012-01-01

    One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

  13. Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Gosink, Luke; Wu, Kesheng; Bethel, E. Wes; Owens, John D.; Joy, Kenneth I.

    2009-06-02

    The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors

  14. The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor

    Science.gov (United States)

    2015-06-13

    Illinois Verilog Model (IVM) is a 4-issue, out- of-order core designed to study transient faults .[13] The Santa Cruz Out-of-Order RISC Engine (SCOORE...ISCA-35, 2008. [8] B. H. Dwiel et al., “Fpga modeling of diverse superscalar processors,” in Performance Analysis of Systems and Software (ISPASS), 2012...al., “Rationale for a 3d heterogeneous multi-core processor,” in Computer Design (ICCD), 2013 IEEE 31st International Conference on, Oct 2013, pp. 154

  15. The Milstar Advanced Processor

    Science.gov (United States)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  16. Modules for Pipelined Mixed Radix FFT Processors

    Directory of Open Access Journals (Sweden)

    Anatolij Sergiyenko

    2016-01-01

    Full Text Available A set of soft IP cores for the Winograd r-point fast Fourier transform (FFT is considered. The cores are designed by the method of spatial SDF mapping into the hardware, which provides the minimized hardware volume at the cost of slowdown of the algorithm by r times. Their clock frequency is equal to the data sampling frequency. The cores are intended for the high-speed pipelined FFT processors, which are implemented in FPGA.

  17. Cross talk analysis in multicore optical fibers by supermode theory.

    Science.gov (United States)

    Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz

    2016-08-15

    We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.

  18. Evaluation of a Multicore-Optimized Implementation for Tomographic Reconstruction

    Science.gov (United States)

    Agulleiro, Jose-Ignacio; Fernández, José Jesús

    2012-01-01

    Tomography allows elucidation of the three-dimensional structure of an object from a set of projection images. In life sciences, electron microscope tomography is providing invaluable information about the cell structure at a resolution of a few nanometres. Here, large images are required to combine wide fields of view with high resolution requirements. The computational complexity of the algorithms along with the large image size then turns tomographic reconstruction into a computationally demanding problem. Traditionally, high-performance computing techniques have been applied to cope with such demands on supercomputers, distributed systems and computer clusters. In the last few years, the trend has turned towards graphics processing units (GPUs). Here we present a detailed description and a thorough evaluation of an alternative approach that relies on exploitation of the power available in modern multicore computers. The combination of single-core code optimization, vector processing, multithreading and efficient disk I/O operations succeeds in providing fast tomographic reconstructions on standard computers. The approach turns out to be competitive with the fastest GPU-based solutions thus far. PMID:23139768

  19. Evaluation of a multicore-optimized implementation for tomographic reconstruction.

    Directory of Open Access Journals (Sweden)

    Jose-Ignacio Agulleiro

    Full Text Available Tomography allows elucidation of the three-dimensional structure of an object from a set of projection images. In life sciences, electron microscope tomography is providing invaluable information about the cell structure at a resolution of a few nanometres. Here, large images are required to combine wide fields of view with high resolution requirements. The computational complexity of the algorithms along with the large image size then turns tomographic reconstruction into a computationally demanding problem. Traditionally, high-performance computing techniques have been applied to cope with such demands on supercomputers, distributed systems and computer clusters. In the last few years, the trend has turned towards graphics processing units (GPUs. Here we present a detailed description and a thorough evaluation of an alternative approach that relies on exploitation of the power available in modern multicore computers. The combination of single-core code optimization, vector processing, multithreading and efficient disk I/O operations succeeds in providing fast tomographic reconstructions on standard computers. The approach turns out to be competitive with the fastest GPU-based solutions thus far.

  20. Interactive Digital Signal Processor

    Science.gov (United States)

    Mish, W. H.

    1985-01-01

    Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.

  1. Beyond processor sharing

    NARCIS (Netherlands)

    Aalto, S.; Ayesta, U.; Borst, S.C.; Misra, V.; Núñez Queija, R.

    2007-01-01

    While the (Egalitarian) Processor-Sharing (PS) discipline offers crucial insights in the performance of fair resource allocation mechanisms, it is inherently limited in analyzing and designing differentiated scheduling algorithms such as Weighted Fair Queueing and Weighted Round-Robin. The Discrimin

  2. ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

    DEFF Research Database (Denmark)

    Liu, Weifeng; Vinter, Brian

    2014-01-01

    -heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative...... platforms, the ad-heap obtains up to 1.5x and 3.6x speedup over the optimal AMP scheduling method that executes the fastest d-heaps on the standalone CPUs and GPUs in parallel....

  3. Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

    Science.gov (United States)

    2014-09-30

    relatively few units manufactured, it is now common to see field- programmable gate arrays (FPGAs) added to designs, sometimes replacing general-purpose...Objective-C, Ruby, C#.NET, JavaScript, and C and are capable of running on Apple iOS, Macintosh OSX, Android , Microsoft Windows, Linux and any modern

  4. Domain Expert-Directed Program Optimizations for Accelerated Performance on Heterogeneous Multi-core Processors

    Science.gov (United States)

    2013-12-01

    modulo ) operation as used in Ed25519 ( A←A+BC2255−19 ) in the radix- 251 redundant representation, and will complete the verification of the same...Taiwan University this year. He is a young up-and-coming expert on formal verification and languages, and his impressive resumè include joint work

  5. Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

    Science.gov (United States)

    2013-02-01

    Acceleration platforms examined. x86 Cell GPGPU HTM [22] × × Dean [25] × × Izhikevich [26] × × × Hodgkin- Huxley [27] × × × Morris Lecar [28...examined. These are the Hodgkin- Huxley [27], Izhikevich [26], Wilson [29], and Morris-Lecar [28] models. The Hodgkin– Huxley model is considered to be...and inactivation of Na currents). Table 2 compares the computation properties of the four models. The Hodgkin– Huxley model utilizes exponential

  6. ATCA/AXIe compatible board for fast control and data acquisition in nuclear fusion experiments

    Energy Technology Data Exchange (ETDEWEB)

    Batista, A.J.N., E-mail: toquim@ipfn.ist.utl.pt [Associacao EURATOM/IST, Instituto de Plasmas e Fusao Nuclear - Laboratorio Associado, Instituto Superior Tecnico - Universidade Tecnica de Lisboa, Lisboa (Portugal); Leong, C.; Bexiga, V. [INESC-ID, Lisboa (Portugal); Rodrigues, A.P.; Combo, A.; Carvalho, B.B.; Fortunato, J.; Correia, M. [Associacao EURATOM/IST, Instituto de Plasmas e Fusao Nuclear - Laboratorio Associado, Instituto Superior Tecnico - Universidade Tecnica de Lisboa, Lisboa (Portugal); Teixeira, J.P.; Teixeira, I.C. [INESC-ID, Lisboa (Portugal); Sousa, J.; Goncalves, B.; Varandas, C.A.F. [Associacao EURATOM/IST, Instituto de Plasmas e Fusao Nuclear - Laboratorio Associado, Instituto Superior Tecnico - Universidade Tecnica de Lisboa, Lisboa (Portugal)

    2012-12-15

    Highlights: Black-Right-Pointing-Pointer High performance board for fast control and data acquisition. Black-Right-Pointing-Pointer Large IO channel number per board with galvanic isolation. Black-Right-Pointing-Pointer Optimized for high reliability and availability. Black-Right-Pointing-Pointer Targeted for nuclear fusion experiments with long duration discharges. Black-Right-Pointing-Pointer To be used on the ITER Fast Plant System Controller prototype. - Abstract: An in-house development of an Advanced Telecommunications Computing Architecture (ATCA) board for fast control and data acquisition, with Input/Output (IO) processing capability, is presented. The architecture, compatible with the ATCA (PICMG 3.4) and ATCA eXtensions for Instrumentation (AXIe) specifications, comprises a passive Rear Transition Module (RTM) for IO connectivity to ease hot-swap maintenance and simultaneously to increase cabling life cycle. The board complies with ITER Fast Plant System Controller (FPSC) guidelines for rear IO connectivity and redundancy, in order to provide high levels of reliability and availability to the control and data acquisition systems of nuclear fusion devices with long duration plasma discharges. Simultaneously digitized data from all Analog to Digital Converters (ADC) of the board can be filtered/decimated in a Field Programmable Gate Array (FPGA), decreasing data throughput, increasing resolution, and sent through Peripheral Component Interconnect (PCI) Express to multi-core processors in the ATCA shelf hub slots. Concurrently the multi-core processors can update the board Digital to Analog Converters (DAC) in real-time. Full-duplex point-to-point communication links between all FPGAs, of peer boards inside the shelf, allow the implementation of distributed algorithms and Multi-Input Multi-Output (MIMO) systems. Support for several timing and synchronization solutions is also provided. Some key features are onboard ADC or DAC modules with galvanic isolation

  7. Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

    Science.gov (United States)

    Li, Xiangyu; Xie, Nijie; Tian, Xinyue

    2017-01-01

    This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730

  8. Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

    Directory of Open Access Journals (Sweden)

    Xiangyu Li

    2017-02-01

    Full Text Available This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430, and that it can make a system do more valuable works and make more than 99.9% use of the power budget.

  9. Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC.

    Science.gov (United States)

    Li, Xiangyu; Xie, Nijie; Tian, Xinyue

    2017-02-08

    This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.

  10. The Central Trigger Processor (CTP)

    CERN Multimedia

    Franchini, Matteo

    2016-01-01

    The Central Trigger Processor (CTP) receives trigger information from the calorimeter and muon trigger processors, as well as from other sources of trigger. It makes the Level-1 decision (L1A) based on a trigger menu.

  11. A Domain Specific DSP Processor

    OpenAIRE

    Tell, Eric

    2001-01-01

    This thesis describes the design of a domain specific DSP processor. The thesis is divided into two parts. The first part gives some theoretical background, describes the different steps of the design process (both for DSP processors in general and for this project) and motivates the design decisions made for this processor. The second part is a nearly complete design specification. The intended use of the processor is as a platform for hardware acceleration units. Support for this has howe...

  12. Processor register error correction management

    Science.gov (United States)

    Bose, Pradip; Cher, Chen-Yong; Gupta, Meeta S.

    2016-12-27

    Processor register protection management is disclosed. In embodiments, a method of processor register protection management can include determining a sensitive logical register for executable code generated by a compiler, generating an error-correction table identifying the sensitive logical register, and storing the error-correction table in a memory accessible by a processor. The processor can be configured to generate a duplicate register of the sensitive logical register identified by the error-correction table.

  13. Dual-core Itanium Processor

    CERN Multimedia

    2006-01-01

    Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.

  14. Automated Parallel Computing Tools for Multicore Machines and Clusters Project

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose to improve productivity of high performance computing for applications on multicore computers and clusters. These machines built from one or more chips...

  15. A Compilation Method and Realization for Heterogeneous Multi-core Systems%一种异构多核系统的编译方法及实现

    Institute of Scientific and Technical Information of China (English)

    刘丹丹; 杨灿美; 倪素萍; 杜学亮

    2015-01-01

    Dedicated areas for heterogeneous multi‐core processors to accelerate computing has made rapid progress in recent years .Heterogeneous multi‐core processor integrates a number of different processor cores .Due to the heterogeneity of the class processor ,its programming are very different compared to the traditional homogeneous multicore processors . The programmer needs to write program code separately and compiled respectively for different architecture processor core ,which increasing the difficulty of software development .On the basis of on the analysis of the heterogeneous multi‐core processor architecture ,program execution model ,we propose a compilation method for compiling heterogeneous multicore systems ,solving the difficulties of programming code and compiled separately ,and supports heterogeneous multicore unified programming code .It shields the underlying hardware heterogeneity ,providing convenience for the development of upper user .%面向专用领域计算加速的异构多核处理器近年来得到长足发展,异构多核处理器中集成了多个不同架构的处理器核。由于该类处理器的异构性,其编程方法较传统的同构多核处理器有很大不同,编程者需要就不同架构的处理器核分别编写程序代码并分别编译,增加了软件开发难度。在分析异构多核处理器体系结构、程序执行模型的基础上,提出了一种异构多核系统的编译方法,并给出系统实现,解决了分别编写程序代码和编译的困难,支持异构多核代码的统一编程,屏蔽底层硬件的异构性,为上层用户开发提供方便。

  16. Single shot polarimetry imaging of multicore fiber

    CERN Document Server

    Sivankutty, Siddharth; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hérve

    2016-01-01

    We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element together with an analyzer and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single mode fiber cores in both conventional and polarization-maintaining bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual ber core including eigen-polarization states, phase delay and diattenuation.

  17. Nonlinear Light Dynamics in Multi-Core Structures

    Science.gov (United States)

    2017-02-27

    be generated in continuous-discrete optical media such as multi-core optical fiber or waveguide arrays; localisation dynamics in a continuous...space and time that can be generated in continuous-discrete optical media such as multi-core optical fiber or waveguide arrays; localisation dynamics in...gives another practical possibility to localize and control light both in space and time. The combination of these two features leads to a rich variety

  18. New Generation Processor Architecture Research

    Institute of Scientific and Technical Information of China (English)

    Chen Hongsong(陈红松); Hu Mingzeng; Ji Zhenzhou

    2003-01-01

    With the rapid development of microelectronics and hardware,the use of ever faster micro-processors and new architecture must be continued to meet tomorrow′s computing needs. New processor microarchitectures are needed to push performance further and to use higher transistor counts effectively.At the same time,aiming at different usages,the processor has been optimized in different aspects,such as high performace,low power consumption,small chip area and high security. SOC (System on chip)and SCMP (Single Chip Multi Processor) constitute the main processor system architecture.

  19. Stereoscopic Optical Signal Processor

    Science.gov (United States)

    Graig, Glenn D.

    1988-01-01

    Optical signal processor produces two-dimensional cross correlation of images from steroscopic video camera in real time. Cross correlation used to identify object, determines distance, or measures movement. Left and right cameras modulate beams from light source for correlation in video detector. Switch in position 1 produces information about range of object viewed by cameras. Position 2 gives information about movement. Position 3 helps to identify object.

  20. Fast and Accurate Semiautomatic Segmentation of Individual Teeth from Dental CT Images.

    Science.gov (United States)

    Kang, Ho Chul; Choi, Chankyu; Shin, Juneseuk; Lee, Jeongjin; Shin, Yeong-Gil

    2015-01-01

    In this paper, we propose a fast and accurate semiautomatic method to effectively distinguish individual teeth from the sockets of teeth in dental CT images. Parameter values of thresholding and shapes of the teeth are propagated to the neighboring slice, based on the separated teeth from reference images. After the propagation of threshold values and shapes of the teeth, the histogram of the current slice was analyzed. The individual teeth are automatically separated and segmented by using seeded region growing. Then, the newly generated separation information is iteratively propagated to the neighboring slice. Our method was validated by ten sets of dental CT scans, and the results were compared with the manually segmented result and conventional methods. The average error of absolute value of volume measurement was 2.29 ± 0.56%, which was more accurate than conventional methods. Boosting up the speed with the multicore processors was shown to be 2.4 times faster than a single core processor. The proposed method identified the individual teeth accurately, demonstrating that it can give dentists substantial assistance during dental surgery.

  1. Fast and Accurate Semiautomatic Segmentation of Individual Teeth from Dental CT Images

    Directory of Open Access Journals (Sweden)

    Ho Chul Kang

    2015-01-01

    Full Text Available DIn this paper, we propose a fast and accurate semiautomatic method to effectively distinguish individual teeth from the sockets of teeth in dental CT images. Parameter values of thresholding and shapes of the teeth are propagated to the neighboring slice, based on the separated teeth from reference images. After the propagation of threshold values and shapes of the teeth, the histogram of the current slice was analyzed. The individual teeth are automatically separated and segmented by using seeded region growing. Then, the newly generated separation information is iteratively propagated to the neighboring slice. Our method was validated by ten sets of dental CT scans, and the results were compared with the manually segmented result and conventional methods. The average error of absolute value of volume measurement was 2.29±0.56%, which was more accurate than conventional methods. Boosting up the speed with the multicore processors was shown to be 2.4 times faster than a single core processor. The proposed method identified the individual teeth accurately, demonstrating that it can give dentists substantial assistance during dental surgery.

  2. Mahali: Space Weather Monitoring Using Multicore Mobile Devices

    Science.gov (United States)

    Pankratius, V.; Lind, F. D.; Coster, A. J.; Erickson, P. J.; Semeter, J. L.

    2013-12-01

    Analysis of Total Electron Content (TEC) measurements derived from Global Positioning System (GPS) signals has led to revolutionary new data products for space weather monitoring and ionospheric research. However, the current sensor network is sparse, especially over the oceans and in regions like Africa and Siberia, and the full potential of dense, global, real-time TEC monitoring remains to be realized. The Mahali project will prototype a revolutionary architecture that uses mobile devices, such as phones and tablets, to form a global space weather monitoring network. Mahali exploits the existing GPS infrastructure - more specifically, delays in multi-frequency GPS signals observed at the ground - to acquire a vast set of global TEC projections, with the goal of imaging multi-scale variability in the global ionosphere at unprecedented spatial and temporal resolution. With connectivity available worldwide, mobile devices are excellent candidates to establish crowd sourced global relays that feed multi-frequency GPS sensor data into a cloud processing environment. Once the data is within the cloud, it is relatively straightforward to reconstruct the structure of the space environment, and its dynamic changes. This vision is made possible owing to advances in multicore technology that have transformed mobile devices into parallel computers with several processors on a chip. For example, local data can be pre-processed, validated with other sensors nearby, and aggregated when transmission is temporarily unavailable. Intelligent devices can also autonomously decide the most practical way of transmitting data with in any given context, e.g., over cell networks or Wifi, depending on availability, bandwidth, cost, energy usage, and other constraints. In the long run, Mahali facilitates data collection from remote locations such as deserts or on oceans. For example, mobile devices on ships could collect time-tagged measurements that are transmitted at a later point in

  3. Architecture-level performance/power tradeoff in network processor design

    Institute of Scientific and Technical Information of China (English)

    CHEN Hong-song; JI Zhen-zhou; HU Ming-zeng

    2007-01-01

    Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modern network processor before making chip. In order to achieve the tradeoff between performance and power,the processor simulator is used to design the architecture of network processor. Using Netbench, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance Intel IXP 2800 Network processor configuration, optimized instruction fetch width and speed 、instruction issue width, instruction window size are analyzed and selected. Simulation results show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.

  4. LASIP-III, a generalized processor for standard interface files. [For creating binary files from BCD input data and printing binary file data in BCD format (devised for fast reactor physics codes)

    Energy Technology Data Exchange (ETDEWEB)

    Bosler, G.E.; O' Dell, R.D.; Resnik, W.M.

    1976-03-01

    The LASIP-III code was developed for processing Version III standard interface data files which have been specified by the Committee on Computer Code Coordination. This processor performs two distinct tasks, namely, transforming free-field format, BCD data into well-defined binary files and providing for printing and punching data in the binary files. While LASIP-III is exported as a complete free-standing code package, techniques are described for easily separating the processor into two modules, viz., one for creating the binary files and one for printing the files. The two modules can be separated into free-standing codes or they can be incorporated into other codes. Also, the LASIP-III code can be easily expanded for processing additional files, and procedures are described for such an expansion. 2 figures, 8 tables.

  5. Multi-core processing and scheduling performance in CMS

    Science.gov (United States)

    Hernández, J. M.; Evans, D.; Foulkes, S.

    2012-12-01

    Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-core aware jobs require the scheduling of multiples cores simultaneously. CMS is exploring the approach of using whole nodes as unit in the workload management system where all cores of a node are allocated to a multi-core job. Whole-node scheduling allows for optimization of the data/workflow management (e.g. I/O caching, local merging) but efficient utilization of all scheduled cores is challenging. Dedicated whole-node queues have been setup at all Tier-1 centers for exploring multi-core processing workflows in CMS. We present the evaluation of the performance scheduling and executing multi-core workflows in whole-node queues compared to the standard single-core processing workflows.

  6. Efficient provisioning for multi-core applications with LSF

    Science.gov (United States)

    Dal Pra, Stefano

    2015-12-01

    Tier-1 sites providing computing power for HEP experiments are usually tightly designed for high throughput performances. This is pursued by reducing the variety of supported use cases and tuning for performances those ones, the most important of which have been that of singlecore jobs. Moreover, the usual workload is saturation: each available core in the farm is in use and there are queued jobs waiting for their turn to run. Enabling multi-core jobs thus requires dedicating a number of hosts where to run, and waiting for them to free the needed number of cores. This drain-time introduces a loss of computing power driven by the number of unusable empty cores. As an increasing demand for multi-core capable resources have emerged, a Task Force have been constituted in WLCG, with the goal to define a simple and efficient multi-core resource provisioning model. This paper details the work done at the INFN Tier-1 to enable multi-core support for the LSF batch system, with the intent of reducing to the minimum the average number of unused cores. The adopted strategy has been that of dedicating to multi-core a dynamic set of nodes, whose dimension is mainly driven by the number of pending multi-core requests and fair-share priority of the submitting user. The node status transition, from single to multi core et vice versa, is driven by a finite state machine which is implemented in a custom multi-core director script, running in the cluster. After describing and motivating both the implementation and the details specific to the LSF batch system, results about performance are reported. Factors having positive and negative impact on the overall efficiency are discussed and solutions to reduce at most the negative ones are proposed.

  7. 基于Octeon处理器三层转发的功能实现%Function Implementation of Layer 3 forwarding strategy based on the Octeon processor

    Institute of Scientific and Technical Information of China (English)

    孙倩

    2013-01-01

    For increasing wireless network bandwidth requirement for data transfer performance in WLAN AC system,Octeon multi-core network processor is studied,a nuclear separation processing architecture based on data plane and control plane is proposed for network traffic.The control planegeneratesthe forwarding table and synchronizes to the data plane.The data plane realizes a fast IP address lookup optimization method based on LC-Trie.It improves the performance of the Layer 3 forwardingeffectively.%针对WLAN AC(无线局域网接入控制器)系统中日益增长的无线网络带宽对数据转发性能的要求,研究了Octeon多核网络处理器,提出了对网络流量按照数据面和控制面进行核间分离的处理架构。在控制面进行三层转发表的生成和对数据面的同步,在数据面通过LC-Trie(级压缩单词查找树)算法的回溯优化实现快速IP地址查找方法,有效地提高了三层转发的性能。

  8. Garbage Collection for Multicore NUMA Machines

    CERN Document Server

    Auhagen, Sven; Fluet, Matthew; Reppy, John

    2011-01-01

    Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are connected via a shared bus, creating a system with a heterogeneous memory hierarchy. Since this shared bus has less bandwidth than the sum of the links to memory, aggregate memory bandwidth is higher when parallel threads all access memory local to their processor package than when they access memory attached to a remote package. This bandwidth limitation has traditionally limited the scalability of modern functional language implementations, which seldom scale well past 8 cores, even on small benchmarks. This work presents a garbage collector integrated with our strict, parallel functional language implementation, Manticore, and shows that it scales effectively on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine.

  9. Breadboard Signal Processor for Arraying DSN Antennas

    Science.gov (United States)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; hide

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  10. Distributed processor allocation for launching applications in a massively connected processors complex

    Science.gov (United States)

    Pedretti, Kevin

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  11. muBLASTP: database-indexed protein sequence search on multicore CPUs.

    Science.gov (United States)

    Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

    2016-11-04

    The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

  12. A Systolic Array RLS Processor

    OpenAIRE

    Asai, T.; Matsumoto, T.

    2000-01-01

    This paper presents the outline of the systolic array recursive least-squares (RLS) processor prototyped primarily with the aim of broadband mobile communication applications. To execute the RLS algorithm effectively, this processor uses an orthogonal triangularization technique known in matrix algebra as QR decomposition for parallel pipelined processing. The processor board comprises 19 application-specific integrated circuit chips, each with approximately one million gates. Thirty-two bit ...

  13. AMD's 64-bit Opteron processor

    CERN Document Server

    CERN. Geneva

    2003-01-01

    This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...

  14. Emerging Trends in Embedded Processors

    Directory of Open Access Journals (Sweden)

    Gurvinder Singh

    2014-05-01

    Full Text Available An Embedded Processors is simply a µProcessors that has been “Embedded” into a device. Embedded systems are important part of human life. For illustration, one cannot visualize life without mobile phones for personal communication. Embedded systems are used in many places like healthcare, automotive, daily life, and in different offices and industries.Embedded Processors develop new research area in the field of hardware designing.

  15. High core count single-mode multicore fiber for dense space division multiplexing

    DEFF Research Database (Denmark)

    Aikawa, K.; Sasaki, Y.; Amma, Y.;

    2016-01-01

    Multicore fibers and few-mode fibers have the potential to realize dense-space-division multiplexing systems. Several dense-space-division multiplexing system transmission experiments over multicore fibers and few-mode fibers have been demonstrated so far. Multicore fibers, including recent resul...

  16. Effective particle magnetic moment of multi-core particles

    Energy Technology Data Exchange (ETDEWEB)

    Ahrentorp, Fredrik; Astalan, Andrea; Blomgren, Jakob; Jonasson, Christian [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden); Wetterskog, Erik; Svedlindh, Peter [Department of Engineering Sciences, Uppsala University, Box 534, SE-751 21 Uppsala (Sweden); Lak, Aidin; Ludwig, Frank [Institute of Electrical Measurement and Fundamental Electrical Engineering, TU Braunschweig, D‐38106 Braunschweig Germany (Germany); IJzendoorn, Leo J. van [Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven (Netherlands); Westphal, Fritz; Grüttner, Cordula [Micromod Partikeltechnologie GmbH, D ‐18119 Rostock (Germany); Gehrke, Nicole [nanoPET Pharma GmbH, D ‐10115 Berlin Germany (Germany); Gustafsson, Stefan; Olsson, Eva [Department of Applied Physics, Chalmers University of Technology, SE-412 96 Göteborg (Sweden); Johansson, Christer, E-mail: christer.johansson@acreo.se [Acreo Swedish ICT AB, Arvid Hedvalls backe 4, SE-411 33 Göteborg (Sweden)

    2015-04-15

    In this study we investigate the magnetic behavior of magnetic multi-core particles and the differences in the magnetic properties of multi-core and single-core nanoparticles and correlate the results with the nanostructure of the different particles as determined from transmission electron microscopy (TEM). We also investigate how the effective particle magnetic moment is coupled to the individual moments of the single-domain nanocrystals by using different measurement techniques: DC magnetometry, AC susceptometry, dynamic light scattering and TEM. We have studied two magnetic multi-core particle systems – BNF Starch from Micromod with a median particle diameter of 100 nm and FeraSpin R from nanoPET with a median particle diameter of 70 nm – and one single-core particle system – SHP25 from Ocean NanoTech with a median particle core diameter of 25 nm.

  17. Performance implications from sizing a VM on multi-core systems: A Data analytic application s view

    Energy Technology Data Exchange (ETDEWEB)

    Lim, Seung-Hwan [ORNL; Horey, James L [ORNL; Begoli, Edmon [ORNL; Yao, Yanjun [University of Tennessee, Knoxville (UTK); Cao, Qing [University of Tennessee, Knoxville (UTK)

    2013-01-01

    In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very rst step in hosting applications in virtualized environments, requires the user to con gure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloud Serving Benchmark (YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identi ed a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.

  18. ParaStream: A parallel streaming Delaunay triangulation algorithm for LiDAR points on multicore architectures

    Science.gov (United States)

    Wu, Huayi; Guan, Xuefeng; Gong, Jianya

    2011-09-01

    This paper presents a robust parallel Delaunay triangulation algorithm called ParaStream for processing billions of points from nonoverlapped block LiDAR files. The algorithm targets ubiquitous multicore architectures. ParaStream integrates streaming computation with a traditional divide-and-conquer scheme, in which additional erase steps are implemented to reduce the runtime memory footprint. Furthermore, a kd-tree-based dynamic schedule strategy is also proposed to distribute triangulation and merging work onto the processor cores for improved load balance. ParaStream exploits most of the computing power of multicore platforms through parallel computing, demonstrating qualities of high data throughput as well as a low memory footprint. Experiments on a 2-Way-Quad-Core Intel Xeon platform show that ParaStream can triangulate approximately one billion LiDAR points (16.4 GB) in about 16 min with only 600 MB physical memory. The total speedup (including I/O time) is about 6.62 with 8 concurrent threads.

  19. GoCxx: a tool to easily leverage C++ legacy code for multicore-friendly Go libraries and frameworks

    Science.gov (United States)

    Binet, Sébastien

    2012-12-01

    Current HENP libraries and frameworks were written before multicore systems became widely deployed and used. From this environment, a ‘single-thread’ processing model naturally emerged but the implicit assumptions it encouraged are greatly impairing our abilities to scale in a multicore/manycore world. Writing scalable code in C++ for multicore architectures, while doable, is no panacea. Sure, C++11 will improve on the current situation (by standardizing on std::thread, introducing lambda functions and defining a memory model) but it will do so at the price of complicating further an already quite sophisticated language. This level of sophistication has probably already strongly motivated analysis groups to migrate to CPython, hoping for its current limitations with respect to multicore scalability to be either lifted (Grand Interpreter Lock removal) or for the advent of a new Python VM better tailored for this kind of environment (PyPy, Jython, …) Could HENP migrate to a language with none of the deficiencies of C++ (build time, deployment, low level tools for concurrency) and with the fast turn-around time, simplicity and ease of coding of Python? This paper will try to make the case for Go - a young open source language with built-in facilities to easily express and expose concurrency - being such a language. We introduce GoCxx, a tool leveraging gcc-xml's output to automatize the tedious work of creating Go wrappers for foreign languages, a critical task for any language wishing to leverage legacy and field-tested code. We will conclude with the first results of applying GoCxx to real C++ code.

  20. Automatic Functionality Assignment to AUTOSAR Multicore Distributed Architectures

    DEFF Research Database (Denmark)

    Maticu, Florin; Pop, Paul; Axbrink, Christian

    2016-01-01

    of better performance, cost, size, fault-tolerance and power consumption. In this paper we present an approach for the automatic software functionality assignment to multicore distributed architectures. We consider that the systems use the AUTomotive Open System ARchitecture (AUTOSAR). The functionality...... is modeled as a set of software components composed of subtasks, called runnables, in AUTOSAR terminology. We have proposed a Simulated Annealing metaheuristic optimization that decides: the (i) mapping of software components to multicore ECUs, (ii) the assignment of runnables to the ECU cores, (iii...

  1. Heterogeneous Multicore Parallel Programming for Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Francois Bodin

    2009-01-01

    Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.

  2. Fast user-level inter-thread communication, synchronisation

    OpenAIRE

    Kevin J. Vella; Lin, Li; ; 1st Workshop in Information and Communication Technology (WICT 2008)

    2008-01-01

    This project concerns the design and implementation of user-level inter-thread synchronisation and communication algorithms. A number of these algorithms have been implemented on the SMASH user-level thread scheduler for symmetric multiprocessors and multicore processors. All inter-thread communication primitives considered have two implementations: the lock-based implementation and the lock-free implementation. The performance of concurrent programs using these user-level primitives are meas...

  3. 基于STM32处理器的锂电池快速充电设计%The Design of Lithium Batteries Fast Charging Based on STM32 Processor

    Institute of Scientific and Technical Information of China (English)

    张洪涛; 彭潇丽

    2012-01-01

    提出了基于STM32处理器的智能管理系统和PFC(功率因数校正)的充电电路对锂离子电池进行充电.利用MATLAB动态仿真工具实现了PFC控制技术的动态仿真,仿真结果达到了预期效果.用C语言编程指令来实现STM32处理器的智能管理.%This paper proposed the design in which the STM32 processor-based intelligent management system and PFC(power factor correction) charging circuit were used to charged lithium ion batteries.MATLAB simulation tool was used to achieve the dynamic simulation of PFC control technology,and the simulation achieved the expected results.C language programming instructions was then used to complete intelligent management of STM32 processor.The results show that the design of experiments is feasible.

  4. Embedded Processor Oriented Compiler Infrastructure

    Directory of Open Access Journals (Sweden)

    DJUKIC, M.

    2014-08-01

    Full Text Available In the recent years, research of special compiler techniques and algorithms for embedded processors broaden the knowledge of how to achieve better compiler performance in irregular processor architectures. However, industrial strength compilers, besides ability to generate efficient code, must also be robust, understandable, maintainable, and extensible. This raises the need for compiler infrastructure that provides means for convenient implementation of embedded processor oriented compiler techniques. Cirrus Logic Coyote 32 DSP is an example that shows how traditional compiler infrastructure is not able to cope with the problem. That is why the new compiler infrastructure was developed for this processor, based on research. in the field of embedded system software tools and experience in development of industrial strength compilers. The new infrastructure is described in this paper. Compiler generated code quality is compared with code generated by the previous compiler for the same processor architecture.

  5. A Reconfigurable Arithmetic Processor

    Science.gov (United States)

    1989-04-01

    FR.IEDR.ICH HEGEL in Philosophy of History (1832) 1.1 The I/O Bandwidth Problem The problem iin building fast arithmetic chips is not building fast arithmetic...word of output data for each opera - CI[ApTEK 2. ARCHZTECTTJRZ 19 27 GOR 18 XOR 20 28 GOI x0l MR x 13 2t 22 GIR xil WAR 22 30 Gil 4 + Wil x )UR x 5 ts

  6. Communication Efficient Multi-processor FFT

    Science.gov (United States)

    Lennart Johnsson, S.; Jacquemin, Michel; Krawitz, Robert L.

    1992-10-01

    Computing the fast Fourier transform on a distributed memory architecture by a direct pipelined radix-2, a bi-section, or a multisection algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input data is in normal order, and the data allocation is consecutive. With a cyclic data allocation, or bit-reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that P/ N + d-2 twiddle factors be stored for P elements distributed evenly over N processors, and the axis that is subject to transformation be distributed over 2 d processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of log 2N more storage for N processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1 Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision and 3 Gflops/s in 64-bit precision. Three-dimensional transforms execute at a slightly lower rate. Distributed ordered transforms execute at a rate of about {1}/{2}to {2}/{3} of the unordered transforms.

  7. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    Science.gov (United States)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  8. A programmable systolic trigger processor for FERA-bus data

    Science.gov (United States)

    Appelquist, G.; Hovander, B.; Sellden, B.; Bohm, C.

    1992-09-01

    A generic CAMAC based trigger processor module for fast processing of large amounts of Analog to Digital Converter (ADC) data was designed. This module was realized using complex programmable gate arrays. The gate arrays were connected to memories and multipliers in such a way that different gate array configurations can cover a wide range of module applications. Using this module, it is possible to construct complex trigger processors. The module uses both the fast ECL FERA bus and the CAMAC bus for inputs and outputs. The latter is used for set up and control but may also be used for data output. Large numbers of ADC's can be served by a hierarchical arrangement of trigger processor modules which process ADC data with pipeline arithmetics and produce the final result at the apex of the pyramid. The trigger decision is transmitted to the data acquisition system via a logic signal while numeric results may be extracted by the CAMAC controller. The trigger processor was developed for the proposed neutral particle search. It was designed to serve as a second level trigger processor. It was required to correct all ADC raw data for efficiency and pedestal, calculate the total calorimeter energy, obtain the optimal time of flight data, and calculate the particle mass. A suitable mass cut would then deliver the trigger decision.

  9. 一种用于人脸检测SoC中的加速协处理器设计%Co-processor implementation for fast face detection in a system-on-chip

    Institute of Scientific and Technical Information of China (English)

    焦继业; 穆荣; 郝跃

    2011-01-01

    提出了一种改进的、适合硬件并行实现的Adaboost算法多层分类器协处理器架构.该协处理器由积分图数据快速读取模块、多层分类器Haar型特征值运算模块、DMA数据存取模块和协处理器接口模块组成,模块间采用流水线以及FIFO缓存实现数据并行处理,用于人脸检测SoC中加速人脸检测迭代过程.将该协处理器嵌入一个实际的人脸检测SoC中,只增加了SoC的少量面积,却明显提高了人脸检测的处理速度.人脸检测SoC在CYCLONE-ⅡEP2C70 FPGA上通过验证.实验结果显示,在系统工作频率为70MHz时,能以10帧每秒的处理速度检测彩色QVGA图像中的人脸.%An improved co-processor architecture suitable for hardware parallel implementation is proposed to perform the feature classification based on the Adaboost algorithm. The co-processor consists of image quick access module, module for calculating the Haar features, DMA data transfer module, and interface to the coprocessor module. Modules use the pipeline and FIFO buffer to process data to accelerate the iterative process of face detection. The co-processor only increases a small area in face detection SoC, but significantly improves the speed of face detection. In addition, we implement the proposed SoC on a CYCLONE-Ⅱ EP2C70 FPGA to show that object detection can be achieved at 10 frames per second at the system operating frequency of 70 MHz on color QVGA camera video.

  10. Automatic generation of application specific FPGA multicore accelerators

    DEFF Research Database (Denmark)

    Hindborg, Andreas Erik; Schleuniger, Pascal; Jensen, Nicklas Bo

    2014-01-01

    High performance computing systems make increasing use of hardware accelerators to improve performance and power properties. For large high-performance FPGAs to be successfully integrated in such computing systems, methods to raise the abstraction level of FPGA programming are required...... to identify optimal performance energy trade-offs points for a multicore based FPGA accelerator....

  11. Improved Multi-Core Nested Depth-First Search

    NARCIS (Netherlands)

    Evangelista, Sami; Laarman, Alfons; Petrucci, Laure; Pol, van de Jaco; Ramesh, S.

    2012-01-01

    This paper presents CNDFS, a tight integration of two earlier multi-core nested depth-first search (NDFS) algorithms for LTL model checking. CNDFS combines the different strengths and avoids some weaknesses of its predecessors. We compare CNDFS to an earlier ad-hoc combination of those two algorithm

  12. Multi-core and/or symbolic model checking

    NARCIS (Netherlands)

    Dijk, van Tom; Laarman, Alfons; Pol, van de Jaco; Luettgen, G.; Merz, S.

    2012-01-01

    We review our progress in high-performance model checking. Our multi-core model checker is based on a scalable hash-table design and parallel random-walk traversal. Our symbolic model checker is based on Multiway Decision Diagrams and the saturation strategy. The LTSmin tool is based on the PINS arc

  13. Quantitative Application Data Flow Characterization for Heterogeneous Multicore Architectures

    NARCIS (Netherlands)

    Ostadzadeh, S.A.

    2012-01-01

    Recent trends show a steady increase in the utilization of heterogeneous multicore architectures in order to address the ever-growing need for computing performance. These emerging architectures pose specific challenges with regard to their programmability. In addition, they require efficient applic

  14. Runtime Support for Heterogeneous Multi-core Systems

    NARCIS (Netherlands)

    Sabeghi, M.

    2011-01-01

    Multi-core processing platforms are one of the major steps forward in offering high-performance computing platforms. The idea is to increase the performance by employing more processing elements to perform a job. However, this creates a challenge for both hardware developers who build such systems a

  15. Energy-ecient scheduling in multi-core servers

    NARCIS (Netherlands)

    Asghari, N.M.; Mandjes, M.; Walid, A.

    2014-01-01

    In this paper we develop techniques for analyzing and optimizing energy management in multi-core servers with speed scaling capabilities. Our framework incorporates the processor’s dynamic power, but it also accounts for other intricate and relevant power features such as the static (leakage) power

  16. Improved Multi-Core Nested Depth-First Search

    NARCIS (Netherlands)

    Evangelista, Sami; Laarman, Alfons; Petrucci, Laure; van de Pol, Jan Cornelis; Ramesh, S.

    2012-01-01

    This paper presents CNDFS, a tight integration of two earlier multi-core nested depth-first search (NDFS) algorithms for LTL model checking. CNDFS combines the different strengths and avoids some weaknesses of its predecessors. We compare CNDFS to an earlier ad-hoc combination of those two

  17. "Photonic lantern" spectral filters in multi-core Fiber.

    Science.gov (United States)

    Birks, T A; Mangan, B J; Díez, A; Cruz, J L; Murphy, D F

    2012-06-18

    Fiber Bragg gratings are written across all 120 single-mode cores of a multi-core optical Fiber. The Fiber is interfaced to multimode ports by tapering it within a depressed-index glass jacket. The result is a compact multimode "photonic lantern" filter with astrophotonic applications. The tapered structure is also an effective mode scrambler.

  18. Multi-Core BDD Operations for Symbolic Reachability

    NARCIS (Netherlands)

    van Dijk, Tom; Laarman, Alfons; van de Pol, Jan Cornelis; Heljanko, K.; Knottenbelt, W.J.

    2012-01-01

    This paper presents scalable parallel BDD operations for modern multi-core hardware. We aim at increasing the performance of reachability analysis in the context of model checking. Existing approaches focus on performing multiple independent BDD operations rather than parallelizing the BDD

  19. Cache-based memory copy hardware accelerator for multicore systems

    NARCIS (Netherlands)

    Duarte, F.; Wong, S.

    2010-01-01

    In this paper, we present a new architecture of the cache-based memory copy hardware accelerator in a multicore system supporting message passing. The accelerator is able to accelerate memory data movements, in particular memory copies. We perform an analytical analysis based on open-queuing theory

  20. The associative memory system for the FTK processor at ATLAS

    CERN Document Server

    Magalotti, D; The ATLAS collaboration; Donati, S; Luciano, P; Piendibene, M; Giannetti, P; Lanza, A; Verzellesi, G; Sakellariou, Andreas; Billereau, W; Combe, J M

    2014-01-01

    In high energy physics experiments, the most interesting processes are very rare and hidden in an extremely large level of background. As the experiment complexity, accelerator backgrounds, and instantaneous luminosity increase, more effective and accurate data selection techniques are needed. The Fast TracKer processor (FTK) is a real time tracking processor designed for the ATLAS trigger upgrade. The FTK core is the Associative Memory system. It provides massive computing power to minimize the processing time of complex tracking algorithms executed online. This paper reports on the results and performance of a new prototype of Associative Memory system.

  1. Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Dong, Fengguang [Univ. of Tennessee, Knoxville, TN (United States); Tomov, Stanimire [Univ. of Tennessee, Knoxville, TN (United States); Dongarra, Jack [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

    2011-06-01

    We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations e ciently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allocate data to the host system and GPUs to minimize communication. We have designed heterogeneous algorithms with two di erent tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scalability, strong scalability, load balance, and e ciency of our approach.

  2. Interferometric synthetic aperture microscopy implementation on a floating point multi-core digital signal processer

    Science.gov (United States)

    Ahmad, Adeel; Ali, Murtaza; South, Fredrick; Monroy, Guillermo L.; Adie, Steven G.; Shemonski, Nathan; Carney, P. Scott; Boppart, Stephen A.

    2013-03-01

    The transition of optical coherence tomography (OCT) technology from the lab environment towards the more challenging clinical and point-of-care settings is continuing at a rapid pace. On one hand this translation opens new opportunities and avenues for growth, while on the other hand it also presents a new set of challenges and constraints under which OCT systems have to operate. OCT systems in the clinical environment are not only required to be user friendly and easy to operate, but should also be portable, have a smaller form factor coupled with low cost and reduced power consumption. Digital signal processors (DSP) are in a unique position to satisfy the computational requirements for OCT at a much lower cost and power consumption compared to the existing platforms such as CPU and graphics processing units (GPUs). In this work, we describe the implementation of optical coherence tomography (OCT) and interferometric synthetic aperture microscopy (ISAM) processing on a floating point multi-core DSP (C6678, Texas Instruments). ISAM is a computationally intensive data processing technique that is based on the re-sampling of the Fourier space of the data to yield spatially invariant transverse resolution in OCT. Preliminary results indicate that 2DISAM processing at 70,000 A-lines/sec and OCT at 180,000 A-lines/sec can be achieved with the current implementation using available DSP hardware.

  3. Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Petascale Systems

    Energy Technology Data Exchange (ETDEWEB)

    Dongarra, Jack J. [University Distinguished Professor; Tomov, Stanimire [Research Scientist

    2014-03-24

    The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid Multicore+GPU-based systems, using all the processing power that future high-end systems can make available within given energy constraints. Our efforts at the University of Tennessee achieved the goals set in all of the five areas identified in the proposal: 1. Communication optimal algorithms; 2. Autotuning for GPU and hybrid processors; 3. Scheduling and memory management techniques for heterogeneity and scale; 4. Fault tolerance and robustness for large scale systems; 5. Building energy efficiency into software foundations. The University of Tennessee’s main contributions, as proposed, were the research and software development of new algorithms for hybrid multi/many-core CPUs and GPUs, as related to two-sided factorizations and complete eigenproblem solvers, hybrid BLAS, and energy efficiency for dense, as well as sparse, operations. Furthermore, as proposed, we investigated and experimented with various techniques targeting the five main areas outlined.

  4. CRBLASTER: A Fast Parallel-Processing Program for Cosmic Ray Rejection in Space-Based Observations

    Science.gov (United States)

    Mighell, K.

    Many astronomical image analysis tasks are based on algorithms that can be described as being embarrassingly parallel - where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of today's fast multi-core servers costing a few thousands of dollars. One reason for the shortage of state-of-the-art parallel processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called CRBLASTER which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface library. Processing a single 800 x 800 Hubble Space Telescope Wide-Field Planetary Camera 2 (WFPC2) image takes 1.9 seconds using 4 processors on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 cores is 82%. The code has been designed to be used as a software framework for the easy development of parallel-processing image-analysis programs using embarrassing parallel algorithms; all that needs to be done is to replace the core image processing task (in this case the C function that performs the L.A.Cosmic algorithm) with an alternative image analysis task based on a single processor algorithm. I describe the design and implementation of the program and then discuss how it could possibly be used to quickly do time-critical analysis applications such as those involved with space surveillance or do complex calibration tasks as part of the pipeline processing of images from large focal plane arrays.

  5. A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES.

    Science.gov (United States)

    Fu, Zhisong; Jeong, Won-Ki; Pan, Yongsheng; Kirby, Robert M; Whitaker, Ross T

    2011-01-01

    This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton-Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512-2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers.

  6. Multicore Architecture-aware Scientific Applications

    Energy Technology Data Exchange (ETDEWEB)

    Srinivasa, Avinash [Iowa State Univ., Ames, IA (United States)

    2011-11-28

    Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a large scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.

  7. Multicore Parallel Implementation of 2D-FFTBased on TMS320C6678 DSP

    Institute of Scientific and Technical Information of China (English)

    Wende Wu[1,2; ZhiyongXu[1

    2015-01-01

    We put forward a multicore parallel plan for 2D-FFT and implement it on TMS320C6678 DSP after we research thecharacteristics of different multicore DSP programming models and two-dimension FFT (2D-FFT). We bring the parallelcomputing capability of multicore DSP into full play and improve working efficiency of 2D-FFT. It has hugely referential valuein implementing image processing arithmetic based on 2D-FFT.

  8. Revisiting Multiple Pattern Matching Algorithms for Multi-Core Architecture

    Institute of Scientific and Technical Information of China (English)

    Guang-Ming Tan; Ping Liu; Dong-Bo Bu; Yan-Bing Liu

    2011-01-01

    Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.

  9. Improved Parallel Apriori Algorithm for Multi-cores

    Directory of Open Access Journals (Sweden)

    Swati Rustogi

    2017-04-01

    Full Text Available Apriori algorithm is one of the most popular data mining techniques, which is used for mining hidden relationship in large data. With parallelism, a large data set can be mined in less amount of time. Apart from the costly distributed systems, a computer supporting multi core environment can be used for applying parallelism. In this paper an improved Apriori algorithm for multi-core environment is proposed. The main contributions of this paper are:  An efficient Apriori algorithm that applies data parallelism in multi-core environment by reducing the time taken to count the frequency of candidate item sets.  The performance of proposed algorithm is evaluated for multiple cores on basis of speedup.  The performance of the proposed algorithm is compared with the other such parallel algorithm and it shows an improvement by more than 15% preliminary experiment

  10. Network Coding Parallelization Based on Matrix Operations for Multicore Architectures

    DEFF Research Database (Denmark)

    Wunderlich, Simon; Cabrera, Juan; Fitzek, Frank

    2015-01-01

    . Despite the fact that single core implementations show already comparable coding speeds with standard coding approaches, this paper pushes network coding to the next level by exploiting multicore architectures. The disruptive idea presented in the paper is to break with current software implementations...... and coding approaches and to adopt highly optimized dense matrix operations from the high performance computation field for network coding in order to increase the coding speed. The paper presents the novel coding approach for multicore architectures and shows coding speed gains on a commercial platform...... such as the Raspberry Pi2 with four cores in the order of up to one full magnitude. The speed increase gain is even higher than the number of cores of the Raspberry Pi2 since the newly introduced approach exploits the cache architecture way better than by-the-book matrix operations. Copyright © 2015 by the Institute...

  11. Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies

    OpenAIRE

    Linford, John Christian

    2010-01-01

    The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling...

  12. A Real-Time Linux for Multicore Platforms

    Science.gov (United States)

    2013-12-20

    Multicore Platforms, 25th Euromicro Conference on Real-TimeSystems. 09-JUL-13, . : , Jeremy Erickson, James Anderson. Reducing Tardiness Under Global...Baruah. "Improved tardiness bounds for Global EDF," Proceedings of the EuroMicro Conference on Real-Time Systems, Brussels, Belgium, IEEE Computer...missed. In a soft real-time application, some deadline tardiness is permissible. In the definition of "soft real-time" that we have focused on

  13. Supermodes in Coupled Multi-Core Waveguide Structures

    Science.gov (United States)

    2016-04-01

    gentle bends . Because modes travel at different group velocities, if three guided modes are assumed to be excited with amplitudes A, B, C and phase φa...mode multi-core waveguide arrays can be strongly affected by angle-dependent cou- plings, leading to different modal field profiles. Analytical...and 37 sites. We begin our analysis by assuming that in all cases, each waveguide element is assumed to be cylindrical (of radius a) and single-moded

  14. Multicore Performance of Block Algebraic Iterative Reconstruction Methods

    DEFF Research Database (Denmark)

    Sørensen, Hans Henrik B.; Hansen, Per Christian

    2014-01-01

    Algebraic iterative methods are routinely used for solving the ill-posed sparse linear systems arising in tomographic image reconstruction. Here we consider the algebraic reconstruction technique (ART) and the simultaneous iterative reconstruction techniques (SIRT), both of which rely...... a fixed relaxation parameter in each method, namely, the one that leads to the fastest semiconvergence. Computational results show that for multicore computers, the sequential approach is preferable....

  15. Multi-core processing and scheduling performance in CMS

    CERN Document Server

    CERN. Geneva

    2012-01-01

    Commodity hardware is going many-core. We might soon not be able to satisfy the job memory needs per core in the current single-core processing model in High Energy Physics. In addition, an ever increasing number of independent and incoherent jobs running on the same physical hardware not sharing resources might significantly affect processing performance. It will be essential to effectively utilize the multi-core architecture. CMS has incorporated support for multi-core processing in the event processing framework and the workload management system. Multi-core processing jobs share common data in memory, such us the code libraries, detector geometry and conditions data, resulting in a much lower memory usage than standard single-core independent jobs. Exploiting this new processing model requires a new model in computing resource allocation, departing from the standard single-core allocation for a job. The experiment job management system needs to have control over a larger quantum of resource since multi-...

  16. PERI - Auto-tuning Memory Intensive Kernels for Multicore

    Energy Technology Data Exchange (ETDEWEB)

    Bailey, David H; Williams, Samuel; Datta, Kaushik; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine; Bailey, David H

    2008-06-24

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to Sparse Matrix Vector Multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4X improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.

  17. Polytopol computing for multi-core and distributed systems

    Science.gov (United States)

    Spaanenburg, Henk; Spaanenburg, Lambert; Ranefors, Johan

    2009-05-01

    Multi-core computing provides new challenges to software engineering. The paper addresses such issues in the general setting of polytopol computing, that takes multi-core problems in such widely differing areas as ambient intelligence sensor networks and cloud computing into account. It argues that the essence lies in a suitable allocation of free moving tasks. Where hardware is ubiquitous and pervasive, the network is virtualized into a connection of software snippets judiciously injected to such hardware that a system function looks as one again. The concept of polytopol computing provides a further formalization in terms of the partitioning of labor between collector and sensor nodes. Collectors provide functions such as a knowledge integrator, awareness collector, situation displayer/reporter, communicator of clues and an inquiry-interface provider. Sensors provide functions such as anomaly detection (only communicating singularities, not continuous observation), they are generally powered or self-powered, amorphous (not on a grid) with generation-and-attrition, field re-programmable, and sensor plug-and-play-able. Together the collector and the sensor are part of the skeleton injector mechanism, added to every node, and give the network the ability to organize itself into some of many topologies. Finally we will discuss a number of applications and indicate how a multi-core architecture supports the security aspects of the skeleton injector.

  18. Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Shalf, John; Yelick, Katherine

    2008-02-01

    We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 14x improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

  19. Composite Pseudo Associative Cache with Victim Cache for Mobile Processors

    Directory of Open Access Journals (Sweden)

    Lakshmi D. Bobbala

    2010-01-01

    Full Text Available Problem statement: Multi-core trends are becoming dominant, creating sophisticated and complicated cache structures. One of the easiest ways to design cache memory for increasing performance is to double the cache size. The big cache size is directly related to the area and power consumption. Especially in mobile processors, simple increase of the cache size may significantly affect its chip area and power. Without increasing the size of the cache, we propose a novel method to improve the overall performance. Approach: We proposed a composite cache mechanism for 1 and L2 cache to maximize cache performance within a given cache size.This technique could be used without increasing cache size and set associatively by emphasizing primary way utilization and pseudo-associatively. We also added victim cache to composite pseudo associative cache for further improvement. Results: Based on our experiments with the sampled SPEC CPU2006 workload, the proposed cache mechanism showed the remarkable reduction in cache misses without effetcing the size. Conclusion/Recommendation: The variation of performance improvement depends on benchmark, cache size and set associatively, but the proposed scheme shows more sensitivity to cache size increase than set associatively increase.

  20. Composite Pseudo Associative Cache with Victim Cache for Mobile Processors

    Directory of Open Access Journals (Sweden)

    Lakshmi D. Bobbala

    2011-01-01

    Full Text Available Problem statement: Multi-core trends are becoming dominant, creating sophisticated and complicated cache structures. One of the easiest ways to design cache memory for increasing performance is to double the cache size. The big cache size is directly related to the area and power consumption. Especially in mobile processors, simple increase of the cache size may significantly affect its chip area and power. Without increasing the size of the cache, we propose a novel method to improve the overall performance. Approach: We proposed a composite cache mechanism for 1 and L2 cache to maximize cache performance within a given cache size.This technique could be used without increasing cache size and set associatively by emphasizing primary way utilization and pseudo-associatively. We also added victim cache to composite pseudo associative cache for further improvement. Results: Based on our experiments with the sampled SPEC CPU2006 workload, the proposed cache mechanism showed the remarkable reduction in cache misses without affecting the size. Conclusion/Recommendation: The variation of performance improvement depends on benchmark, cache size and set associatively, but the proposed scheme shows more sensitivity to cache size increase than set associatively increase.

  1. Libera Electron Beam Position Processor

    CERN Document Server

    Ursic, Rok

    2005-01-01

    Libera is a product family delivering unprecedented possibilities for either building powerful single station solutions or architecting complex feedback systems in the field of accelerator instrumentation and controls. This paper presents functionality and field performance of its first member, the electron beam position processor. It offers superior performance with multiple measurement channels delivering simultaneously position measurements in digital format with MHz kHz and Hz bandwidths. This all-in-one product, facilitating pulsed and CW measurements, is much more than simply a high performance beam position measuring device delivering micrometer level reproducibility with sub-micrometer resolution. Rich connectivity options and innate processing power make it a powerful feedback building block. By interconnecting multiple Libera electron beam position processors one can build a low-latency high throughput orbit feedback system without adding additional hardware. Libera electron beam position processor ...

  2. Java Processor Optimized for RTSJ

    Directory of Open Access Journals (Sweden)

    Tu Shiliang

    2007-01-01

    Full Text Available Due to the preeminent work of the real-time specification for Java (RTSJ, Java is increasingly expected to become the leading programming language in real-time systems. To provide a Java platform suitable for real-time applications, a Java processor which can execute Java bytecode is directly proposed in this paper. It provides efficient support in hardware for some mechanisms specified in the RTSJ and offers a simpler programming model through ameliorating the scoped memory of the RTSJ. The worst case execution time (WCET of the bytecodes implemented in this processor is predictable by employing the optimization method proposed in our previous work, in which all the processing interfering predictability is handled before bytecode execution. Further advantage of this method is to make the implementation of the processor simpler and suited to a low-cost FPGA chip.

  3. Making CSB + -Trees Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  4. Reconfigurable Communication Processor:A New Approach for Network Processor

    Institute of Scientific and Technical Information of China (English)

    孙华; 陈青山; 张文渊

    2003-01-01

    As the traditional RISC +ASIC/ASSP approach for network processor design can not meet the today'srequirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost theperformance to ASIC level while reserve the programmability of the traditional RISC based system. This papercovers both the hardware architecture and the software development environment architecture.

  5. Fast Automatic Segmentation of White Matter Streamlines Based on a Multi-Subject Bundle Atlas.

    Science.gov (United States)

    Labra, Nicole; Guevara, Pamela; Duclap, Delphine; Houenou, Josselin; Poupon, Cyril; Mangin, Jean-François; Figueroa, Miguel

    2017-01-01

    This paper presents an algorithm for fast segmentation of white matter bundles from massive dMRI tractography datasets using a multisubject atlas. We use a distance metric to compare streamlines in a subject dataset to labeled centroids in the atlas, and label them using a per-bundle configurable threshold. In order to reduce segmentation time, the algorithm first preprocesses the data using a simplified distance metric to rapidly discard candidate streamlines in multiple stages, while guaranteeing that no false negatives are produced. The smaller set of remaining streamlines is then segmented using the original metric, thus eliminating any false positives from the preprocessing stage. As a result, a single-thread implementation of the algorithm can segment a dataset of almost 9 million streamlines in less than 6 minutes. Moreover, parallel versions of our algorithm for multicore processors and graphics processing units further reduce the segmentation time to less than 22 seconds and to 5 seconds, respectively. This performance enables the use of the algorithm in truly interactive applications for visualization, analysis, and segmentation of large white matter tractography datasets.

  6. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Science.gov (United States)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2017-08-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  7. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Directory of Open Access Journals (Sweden)

    Cerati Giuseppe

    2017-01-01

    Full Text Available For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU, ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC, for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  8. ASSP Advanced Sensor Signal Processor.

    Science.gov (United States)

    1984-06-01

    transfer data sad cimeds . When a Processor receives the required data (Image) md/or oamand, that data will be operated on B-3 I I I autonomouly. The...BAN is provided by two separately controled DMA address generator chips (Am29o40). Each of these DMA chips create an 8 bit address. One DMA chip gene

  9. LIBS data analysis using a predictor-corrector based digital signal processor algorithm

    Science.gov (United States)

    Sanders, Alex; Griffin, Steven T.; Robinson, Aaron

    2012-06-01

    There are many accepted sensor technologies for generating spectra for material classification. Once the spectra are generated, communication bandwidth limitations favor local material classification with its attendant reduction in data transfer rates and power consumption. Transferring sensor technologies such as Cavity Ring-Down Spectroscopy (CRDS) and Laser Induced Breakdown Spectroscopy (LIBS) require effective material classifiers. A result of recent efforts has been emphasis on Partial Least Squares - Discriminant Analysis (PLS-DA) and Principle Component Analysis (PCA). Implementation of these via general purpose computers is difficult in small portable sensor configurations. This paper addresses the creation of a low mass, low power, robust hardware spectra classifier for a limited set of predetermined materials in an atmospheric matrix. Crucial to this is the incorporation of PCA or PLS-DA classifiers into a predictor-corrector style implementation. The system configuration guarantees rapid convergence. Software running on multi-core Digital Signal Processor (DSPs) simulates a stream-lined plasma physics model estimator, reducing Analog-to-Digital (ADC) power requirements. This paper presents the results of a predictorcorrector model implemented on a low power multi-core DSP to perform substance classification. This configuration emphasizes the hardware system and software design via a predictor corrector model that simultaneously decreases the sample rate while performing the classification.

  10. Cassava processors' awareness of occupational and environmental ...

    African Journals Online (AJOL)

    Cassava processors' awareness of occupational and environmental hazards ... Majority of the respondents also complained of lack of water (78.4%), lack of ... so as to reduce the problems faced by cassava processors during processing.

  11. Parallel Sparse Linear System and Eigenvalue Problem Solvers: From Multicore to Petascale Computing

    Science.gov (United States)

    2015-06-01

    problems that achieve high performance on a single multicore node and clusters of many multicore nodes. Further, we demonstrate both the superior ...the superior robustness and parallel scalability of our solvers compared to other publicly available parallel solvers for these two fundamental...LU‐ and algebraic multigrid‐preconditioned Krylov subspace methods. This has been demonstrated in previous annual reports of this

  12. An object-oriented bulk synchronous parallel library for multicore programming

    NARCIS (Netherlands)

    Yzelman, A.N.; Bisseling, R.H.

    2012-01-01

    We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has

  13. Novel Crosstalk Measurement Method for Multi-Core Fiber Fan-In/Fan-Out Devices

    DEFF Research Database (Denmark)

    Ye, Feihong; Ono, Hirotaka; Abe, Yoshiteru;

    2016-01-01

    We propose a new crosstalk measurement method for multi-core fiber fan-in/fan-out devices utilizing the Fresnel reflection. Compared with the traditional method using core-to-core coupling between a multi-core fiber and a single-mode fiber, the proposed method has the advantages of high reliabili...

  14. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... on FPGA-based processor cores: first, superpipelining enables higher-frequency system clocks, and second, predicated instructions circumvent costly pipeline stalls due to branches. To evaluate their effects, we develop Tinuso, a processor architecture optimized for FPGA implementation. We demonstrate...

  15. 40 CFR 791.45 - Processors.

    Science.gov (United States)

    2010-07-01

    ... 40 Protection of Environment 31 2010-07-01 2010-07-01 true Processors. 791.45 Section 791.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) TOXIC SUBSTANCES CONTROL ACT (CONTINUED) DATA REIMBURSEMENT Basis for Proposed Order § 791.45 Processors. (a) Generally, processors will be...

  16. Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Datta, Kaushik; Murphy, Mark; Volkov, Vasily; Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Patterson, David; Shalf, John; Yelick, Katherine

    2008-08-22

    Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.

  17. Addressing the challenges of standalone multi-core simulations in molecular dynamics

    Science.gov (United States)

    Ocaya, R. O.; Terblans, J. J.

    2017-07-01

    Computational modelling in material science involves mathematical abstractions of force fields between particles with the aim to postulate, develop and understand materials by simulation. The aggregated pairwise interactions of the material's particles lead to a deduction of its macroscopic behaviours. For practically meaningful macroscopic scales, a large amount of data are generated, leading to vast execution times. Simulation times of hours, days or weeks for moderately sized problems are not uncommon. The reduction of simulation times, improved result accuracy and the associated software and hardware engineering challenges are the main motivations for many of the ongoing researches in the computational sciences. This contribution is concerned mainly with simulations that can be done on a "standalone" computer based on Message Passing Interfaces (MPI), parallel code running on hardware platforms with wide specifications, such as single/multi- processor, multi-core machines with minimal reconfiguration for upward scaling of computational power. The widely available, documented and standardized MPI library provides this functionality through the MPI_Comm_size (), MPI_Comm_rank () and MPI_Reduce () functions. A survey of the literature shows that relatively little is written with respect to the efficient extraction of the inherent computational power in a cluster. In this work, we discuss the main avenues available to tap into this extra power without compromising computational accuracy. We also present methods to overcome the high inertia encountered in single-node-based computational molecular dynamics. We begin by surveying the current state of the art and discuss what it takes to achieve parallelism, efficiency and enhanced computational accuracy through program threads and message passing interfaces. Several code illustrations are given. The pros and cons of writing raw code as opposed to using heuristic, third-party code are also discussed. The growing trend

  18. Digital signal processor and processing method for GPS receivers

    Science.gov (United States)

    Thomas, Jr., Jess B. (Inventor)

    1989-01-01

    A digital signal processor and processing method therefor for use in receivers of the NAVSTAR/GLOBAL POSITIONING SYSTEM (GPS) employs a digital carrier down-converter, digital code correlator and digital tracking processor. The digital carrier down-converter and code correlator consists of an all-digital, minimum bit implementation that utilizes digital chip and phase advancers, providing exceptional control and accuracy in feedback phase and in feedback delay. Roundoff and commensurability errors can be reduced to extremely small values (e.g., less than 100 nanochips and 100 nanocycles roundoff errors and 0.1 millichip and 1 millicycle commensurability errors). The digital tracking processor bases the fast feedback for phase and for group delay in the C/A, P.sub.1, and P.sub.2 channels on the L.sub.1 C/A carrier phase thereby maintaining lock at lower signal-to-noise ratios, reducing errors in feedback delays, reducing the frequency of cycle slips and in some cases obviating the need for quadrature processing in the P channels. Simple and reliable methods are employed for data bit synchronization, data bit removal and cycle counting. Improved precision in averaged output delay values is provided by carrier-aided data-compression techniques. The signal processor employs purely digital operations in the sense that exactly the same carrier phase and group delay measurements are obtained, to the last decimal place, every time the same sampled data (i.e., exactly the same bits) are processed.

  19. Clinical value of core length in contemporary multicore prostate biopsy.

    Science.gov (United States)

    Lee, Sangchul; Jeong, Seong Jin; Hwang, Sung Il; Hong, Sung Kyu; Lee, Hak Jong; Byun, Seok Soo; Choe, Gheeyoung; Lee, Sang Eun

    2015-01-01

    There is little data about the clinical value of core length for prostate biopsy (PBx). We investigated the clinical values of various clinicopathological biopsy-related parameters, including core length, in the contemporary multi-core PBx. Medical records of 5,243 consecutive patients who received PBx at our institution were reviewed. Among them, 3,479 patients with prostate-specific antigen (PSA) ≤ 10 ng/ml level who received transrectal ultrasound (TRUS)-guided multi (≥ 12)-core PBx at our institution were analyzed for prostate cancer (PCa). Gleason score upgrading (GSU) was analyzed in 339 patients who were diagnosed with low-risk PCa and received radical prostatectomy. Multivariate logistic regression analyses for PCa detection and prediction of GSU were performed. The mean age and PSA of the entire cohort were 63.5 years and 5.4 ng/ml, respectively. The overall cancer detection rate was 28.5%. There was no statistical difference in core length between patients diagnosed with PCa and those without PCa (16.1 ± 1.8 vs 16.1 ± 1.9 mm, P = 0.945). The core length was also not significantly different (16.4 ± 1.7 vs 16.4 ± 1.6mm, P = 0.889) between the GSU group and non-GSU group. Multivariate logistic regression analyses demonstrated that the core length of PBx did not affect PCa detection in TRUS-guided multi-core PBx (P = 0.923) and was not prognostic for GSU in patients with low-risk PCa (P = 0.356). In patients undergoing contemporary multi-core PBx, core length may not have significant impact on PCa detection and also GSU following radical prostatectomy among low-risk PCa group.

  20. Clinical value of core length in contemporary multicore prostate biopsy.

    Directory of Open Access Journals (Sweden)

    Sangchul Lee

    Full Text Available There is little data about the clinical value of core length for prostate biopsy (PBx. We investigated the clinical values of various clinicopathological biopsy-related parameters, including core length, in the contemporary multi-core PBx.Medical records of 5,243 consecutive patients who received PBx at our institution were reviewed. Among them, 3,479 patients with prostate-specific antigen (PSA ≤ 10 ng/ml level who received transrectal ultrasound (TRUS-guided multi (≥ 12-core PBx at our institution were analyzed for prostate cancer (PCa. Gleason score upgrading (GSU was analyzed in 339 patients who were diagnosed with low-risk PCa and received radical prostatectomy. Multivariate logistic regression analyses for PCa detection and prediction of GSU were performed.The mean age and PSA of the entire cohort were 63.5 years and 5.4 ng/ml, respectively. The overall cancer detection rate was 28.5%. There was no statistical difference in core length between patients diagnosed with PCa and those without PCa (16.1 ± 1.8 vs 16.1 ± 1.9 mm, P = 0.945. The core length was also not significantly different (16.4 ± 1.7 vs 16.4 ± 1.6mm, P = 0.889 between the GSU group and non-GSU group. Multivariate logistic regression analyses demonstrated that the core length of PBx did not affect PCa detection in TRUS-guided multi-core PBx (P = 0.923 and was not prognostic for GSU in patients with low-risk PCa (P = 0.356.In patients undergoing contemporary multi-core PBx, core length may not have significant impact on PCa detection and also GSU following radical prostatectomy among low-risk PCa group.

  1. Light dynamics in nonlinear trimers ans twisted multicore fibers

    CERN Document Server

    Castro-Castro, Claudia; Srinivasan, Gowri; Aceves, Alejandro B; Kevrekidis, Panayotis G

    2016-01-01

    Novel photonic structures such as multi-core fibers and graphene based arrays present unique opportunities to manipulate and control the propagation of light. Here we discuss nonlinear dynamics for structures with a few (2 to 6) elements for which linear and nonlinear properties can be tuned. Specifically we show how nonlinearity, coupling, and parity-time PT symmetric gain/loss relate to existence, stability and in general, dynamical properties of nonlinear optical modes. The main emphasis of our presentation will be on systems with few degrees of freedom, most notably couplers, trimers and generalizations thereof to systems with 6 nodes.

  2. Coupling losses in perfluorinated multi-core polymer optical fibers.

    Science.gov (United States)

    Durana, Gaizka; Aldabaldetreku, Gotzon; Zubia, Joseba; Arrue, Jon; Tanaka, Chikafumi

    2008-05-26

    The aim of the present paper is to provide a comprehensive analysis of coupling losses in perfluorinated (PF) multi-core polymer optical fibers (MC-POFs), which consist of groups of 127 graded-index cores. In our analysis we take into account geometrical, longitudinal, transverse, and angular misalignments. We perform several experimental measurements and computer simulations in order to calculate the coupling losses for a PF MC-POF prototype. Based on these results, we propose several hints of practical interest to the manufacturer which would allow an appropriate connector design in order to handle conveniently the coupling losses incurred when connectorizing two PF MC-POFs.

  3. Communications systems and methods for subsea processors

    Science.gov (United States)

    Gutierrez, Jose; Pereira, Luis

    2016-04-26

    A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.

  4. Invasive tightly coupled processor arrays

    CERN Document Server

    LARI, VAHID

    2016-01-01

    This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...

  5. SystemC建模在多核处理器设计中的应用%Application of SystemC modeling in multi-core design

    Institute of Scientific and Technical Information of China (English)

    许汉荆; 陈杰; 刘建; 敖天勇; 奚杰

    2009-01-01

    "Costar Ⅳ"is a heterogeneous multi-core processor which is developed by IMECAS CoMMSOC Lab. And this paper approaches the success of using ESL in the SOC design of this multi-core processor, modeling key element-MIPS with SystemC, HW/SW co-design and verification with tools such as Visual Studio and Modelsim. Practice has proved that the use of SystemC model not only to enhance effective in development of projects in parallel, shorten the develop-ment cycle, and to provide a detailed reference data for the verification and optimization, and simplifies debugging.%"同芯Ⅳ"是中国科学院微电子研究所通信与多媒体SOC实验室设计的一款多核异构处理器.本文将电子系统级(E-lectronic System Level,ESL)设计方法成功应用于该处理器SOC设计,通过SystemC对系统关键单元MIPS处理器建模,利用Visual Studio和Modelsim等工具进行软硬件协同设计、验证.实践证明利用SystemC模型进行软硬件协同设计有效提高了开发并行度,缩短开发周期,为验证和性能优化提供了详尽的参考数据,简化了调试.

  6. An Experimental Digital Image Processor

    Science.gov (United States)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  7. Parallel Computation on Multicore Processors Using Explicit Form of the Finite Element Method and C++ Standard Libraries

    Directory of Open Access Journals (Sweden)

    Rek Václav

    2016-11-01

    Full Text Available In this paper, the form of modifications of the existing sequential code written in C or C++ programming language for the calculation of various kind of structures using the explicit form of the Finite Element Method (Dynamic Relaxation Method, Explicit Dynamics in the NEXX system is introduced. The NEXX system is the core of engineering software NEXIS, Scia Engineer, RFEM and RENEX. It has the possibilities of multithreaded running, which can now be supported at the level of native C++ programming language using standard libraries. Thanks to the high degree of abstraction that a contemporary C++ programming language provides, a respective library created in this way can be very generalized for other purposes of usage of parallelism in computational mechanics.

  8. OpenCL Performance Evaluation on Modern Multicore CPUs

    Directory of Open Access Journals (Sweden)

    Joo Hwan Lee

    2015-01-01

    Full Text Available Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.

  9. Neural networks within multi-core optic fibers.

    Science.gov (United States)

    Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

    2016-07-07

    Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.

  10. Neural networks within multi-core optic fibers

    Science.gov (United States)

    Cohen, Eyal; Malka, Dror; Shemer, Amir; Shahmoon, Asaf; Zalevsky, Zeev; London, Michael

    2016-07-01

    Hardware implementation of artificial neural networks facilitates real-time parallel processing of massive data sets. Optical neural networks offer low-volume 3D connectivity together with large bandwidth and minimal heat production in contrast to electronic implementation. Here, we present a conceptual design for in-fiber optical neural networks. Neurons and synapses are realized as individual silica cores in a multi-core fiber. Optical signals are transferred transversely between cores by means of optical coupling. Pump driven amplification in erbium-doped cores mimics synaptic interactions. We simulated three-layered feed-forward neural networks and explored their capabilities. Simulations suggest that networks can differentiate between given inputs depending on specific configurations of amplification; this implies classification and learning capabilities. Finally, we tested experimentally our basic neuronal elements using fibers, couplers, and amplifiers, and demonstrated that this configuration implements a neuron-like function. Therefore, devices similar to our proposed multi-core fiber could potentially serve as building blocks for future large-scale small-volume optical artificial neural networks.

  11. ASIC Design of Floating-Point FFT Processor

    Institute of Scientific and Technical Information of China (English)

    陈禾; 赵忠武

    2004-01-01

    An application specific integrated circuit (ASIC) design of a 1024 points floating-point fast Fourier transform(FFT) processor is presented. It can satisfy the requirement of high accuracy FFT result in related fields. Several novel design techniques for floating-point adder and multiplier are introduced in detail to enhance the speed of the system. At the same time, the power consumption is decreased. The hardware area is effectively reduced as an improved butterfly processor is developed. There is a substantial increase in the performance of the design since a pipelined architecture is adopted, and very large scale integrated (VLSI) is easy to realize due to the regularity. A result of validation using field programmable gate array (FPGA) is shown at the end. When the system clock is set to 50 MHz, 204.8 μs is needed to complete the operation of FFT computation.

  12. Parallel information transfer in a multinode quantum information processor.

    Science.gov (United States)

    Borneman, T W; Granade, C E; Cory, D G

    2012-04-06

    We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel.

  13. Parallel computation of a dam-break flow model using OpenMP on a multi-core computer

    Science.gov (United States)

    Zhang, Shanghong; Xia, Zhongxi; Yuan, Rui; Jiang, Xiaoming

    2014-05-01

    High-performance calculations are of great importance to the simulation of dam-break events, as discontinuous solutions and accelerated speed are key factors in the process of dam-break flow modeling. In this study, Roe's approximate Riemann solution of the finite volume method is adopted to solve the interface flux of grid cells and accurately simulate the discontinuous flow, and shared memory technology (OpenMP) is used to realize parallel computing. Because an explicit discrete technique is used to solve the governing equations, and there is no correlation between grid calculations in a single time step, the parallel dam-break model can be easily realized by adding OpenMP instructions to the loop structure of the grid calculations. The performance of the model is analyzed using six computing cores and four different grid division schemes for the Pangtoupao flood storage area in China. The results show that the parallel computing improves precision and increases the simulation speed of the dam-break flow, the simulation of 320 h flood process can be completed within 1.6 h on a 16-kernel computer; a speedup factor of 8.64× is achieved. Further analysis reveals that the models involving a larger number of calculations exhibit greater efficiency and a higher rate of acceleration. At the same time, the model has good extendibility, as the speedup increases with the number of processor cores. The parallel model based on OpenMP can make full use of multi-core processors, making it possible to simulate dam-break flows in large-scale watersheds on a single computer.

  14. A large effective area multi-core fiber with an optimized cladding thickness

    OpenAIRE

    2011-01-01

    The cladding thickness of trench-assisted multi-core fibers was theoretically and experimentally investigated in terms of excess losses of outer cores. No significant micro-bending loss increase was observed on multi-core fibers with the cladding thickness of about 30 μm. The tolerance for the micro-bending loss of a multi-core fiber is larger than that of the single core fiber. However, the cladding thickness will be limited by the occurrence of the excess loss on outer cores. The reduction ...

  15. A Methodology for Optimizing Multithreaded System Scalability on Multi-cores

    CERN Document Server

    Gunther, Neil J; Parvu, Stefan

    2011-01-01

    We show how to quantify scalability with the Universal Scalability Law (USL) by applying it to performance measurements of memcached, J2EE, and Weblogic on multi-core platforms. Since commercial multicores are essentially black-boxes, the accessible performance gains are primarily available at the application level. We also demonstrate how our methodology can identify the most significant performance tuning opportunities to optimize application scalability, as well as providing an easy means for exploring other aspects of the multi-core system design space.

  16. On Designing Multicore-Aware Simulators for Systems Biology Endowed with OnLine Statistics

    Directory of Open Access Journals (Sweden)

    Marco Aldinucci

    2014-01-01

    Full Text Available The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.

  17. On designing multicore-aware simulators for systems biology endowed with OnLine statistics.

    Science.gov (United States)

    Aldinucci, Marco; Calcagno, Cristina; Coppo, Mario; Damiani, Ferruccio; Drocco, Maurizio; Sciacca, Eva; Spinella, Salvatore; Torquati, Massimo; Troina, Angelo

    2014-01-01

    The paper arguments are on enabling methodologies for the design of a fully parallel, online, interactive tool aiming to support the bioinformatics scientists .In particular, the features of these methodologies, supported by the FastFlow parallel programming framework, are shown on a simulation tool to perform the modeling, the tuning, and the sensitivity analysis of stochastic biological models. A stochastic simulation needs thousands of independent simulation trajectories turning into big data that should be analysed by statistic and data mining tools. In the considered approach the two stages are pipelined in such a way that the simulation stage streams out the partial results of all simulation trajectories to the analysis stage that immediately produces a partial result. The simulation-analysis workflow is validated for performance and effectiveness of the online analysis in capturing biological systems behavior on a multicore platform and representative proof-of-concept biological systems. The exploited methodologies include pattern-based parallel programming and data streaming that provide key features to the software designers such as performance portability and efficient in-memory (big) data management and movement. Two paradigmatic classes of biological systems exhibiting multistable and oscillatory behavior are used as a testbed.

  18. Fast semivariogram computation using FPGA architectures

    Science.gov (United States)

    Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang

    2015-02-01

    . Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices

  19. Functional Verification of Enhanced RISC Processor

    OpenAIRE

    SHANKER NILANGI; SOWMYA L

    2013-01-01

    This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...

  20. Digital Signal Processor For GPS Receivers

    Science.gov (United States)

    Thomas, J. B.; Meehan, T. K.; Srinivasan, J. M.

    1989-01-01

    Three innovative components combined to produce all-digital signal processor with superior characteristics: outstanding accuracy, high-dynamics tracking, versatile integration times, lower loss-of-lock signal strengths, and infrequent cycle slips. Three components are digital chip advancer, digital carrier downconverter and code correlator, and digital tracking processor. All-digital signal processor intended for use in receivers of Global Positioning System (GPS) for geodesy, geodynamics, high-dynamics tracking, and ionospheric calibration.

  1. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... through the use of micro-benchmarks that our principles guide the design of a processor core that improves performance by an average of 38% over a similar Xilinx MicroBlaze configuration....

  2. Multicore: Fallout From a Computing Evolution (LBNL Summer Lecture Series)

    Energy Technology Data Exchange (ETDEWEB)

    Yelick, Kathy [Director, NERSC

    2008-07-22

    Summer Lecture Series 2008: Parallel computing used to be reserved for big science and engineering projects, but in two years that's all changed. Even laptops and hand-helds use parallel processors. Unfortunately, the software hasn't kept pace. Kathy Yelick, Director of the National Energy Research Scientific Computing Center at Berkeley Lab, describes the resulting chaos and the computing community's efforts to develop exciting applications that take advantage of tens or hundreds of processors on a single chip.

  3. The case for a generic implant processor.

    Science.gov (United States)

    Strydis, Christos; Gaydadjiev, Georgi N

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably selected processor simulator, various operational aspects of the application are being monitored. Findings on performance, cache behavior, branch prediction, power consumption, energy expenditure and instruction mixes are presented and analyzed. The suitability of such an implant processor and directions for future work are given.

  4. Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

    Directory of Open Access Journals (Sweden)

    Masa-aki FUKASE

    2007-12-01

    Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.

  5. Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

    Directory of Open Access Journals (Sweden)

    Masa-aki FUKASE

    2007-12-01

    Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.

  6. Development and validation of a two-dimensional fast-response flood estimation model

    Energy Technology Data Exchange (ETDEWEB)

    Judi, David R [Los Alamos National Laboratory; Mcpherson, Timothy N [Los Alamos National Laboratory; Burian, Steven J [UNIV OF UTAK

    2009-01-01

    A finite difference formulation of the shallow water equations using an upwind differencing method was developed maintaining computational efficiency and accuracy such that it can be used as a fast-response flood estimation tool. The model was validated using both laboratory controlled experiments and an actual dam breach. Through the laboratory experiments, the model was shown to give good estimations of depth and velocity when compared to the measured data, as well as when compared to a more complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies complex two-dimensional model. Additionally, the model was compared to high water mark data obtained from the failure of the Taum Sauk dam. The simulated inundation extent agreed well with the observed extent, with the most notable differences resulting from the inability to model sediment transport. The results of these validation studies show that a relatively numerical scheme used to solve the complete shallow water equations can be used to accurately estimate flood inundation. Future work will focus on further reducing the computation time needed to provide flood inundation estimates for fast-response analyses. This will be accomplished through the efficient use of multi-core, multi-processor computers coupled with an efficient domain-tracking algorithm, as well as an understanding of the impacts of grid resolution on model results.

  7. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie; Vega, Leticia; Adam, Niklas; Flynn, Michael; Wjee (er. Rau); Lunn, Griffin; Jackson, Andrew

    2012-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrogen and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  8. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie L.; Adam, Niklas M.; Barta, Daniel; Meyer, Caitlin E.; Pensinger, Stuart; Vega, Leticia M.; Callahan, Michael R.; Flynn, Michael; Wheeler, Ray; hide

    2013-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrification and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  9. Automata-based Optimization of Interaction Protocols for Scalable Multicore Platforms (Technical Report)

    NARCIS (Netherlands)

    Jongmans, S.-S.T.Q.; Halle, S.; Arbab, F.

    2014-01-01

    Multicore platforms offer the opportunity for utilizing massively parallel resources. However, programming them is challenging. We need good compilers that optimize commonly occurring synchronization/interaction patterns. To facilitate optimization, a programming language must convey what needs to b

  10. A large effective area multi-core fiber with an optimized cladding thickness.

    Science.gov (United States)

    Takenaga, Katsuhiro; Arakawa, Yoko; Sasaki, Yusuke; Tanigawa, Shoji; Matsuo, Shoichiro; Saitoh, Kunimasa; Koshiba, Masanori

    2011-12-12

    The cladding thickness of trench-assisted multi-core fibers was theoretically and experimentally investigated in terms of excess losses of outer cores. No significant micro-bending loss increase was observed on multi-core fibers with the cladding thickness of about 30 µm. The tolerance for the micro-bending loss of a multi-core fiber is larger than that of the single core fiber. However, the cladding thickness will be limited by the occurrence of the excess loss on outer cores. The reduction of cladding thickness is probably limited around 40 µm in terms of the excess loss. The multi-core fiber with an effective area of 110 µm(2) at 1.55 µm and 181-µm cladding diameter was realized without any excess loss.

  11. Fast Low Power ADC with Integrated Digital Data Processor Project

    Data.gov (United States)

    National Aeronautics and Space Administration — Innovative data measurement/acquisition systems are needed to support future Earth System Science measurements of the Earth's atmosphere and surface. An adequate...

  12. Efficient Thread Mapping for Heterogeneous Multicore IoT Systems

    Directory of Open Access Journals (Sweden)

    Thomas Mezmur Birhanu

    2017-01-01

    Full Text Available This paper proposes a thread scheduling mechanism primed for heterogeneously configured multicore systems. Our approach considers CPU utilization for mapping running threads with the appropriate core that can potentially deliver the actual needed capacity. The paper also introduces a mapping algorithm that is able to map threads to cores in an O(N log M time complexity, where N is the number of cores and M is the number of types of cores. In addition to that we also introduced a method of profiling heterogeneous architectures based on the discrepancy between the performances of individual cores. Our heterogeneity aware scheduler was able to speed up processing by 52.62% and save power by 2.22% as compared to the CFS scheduler that is a default in Linux systems.

  13. I/O Strategies for Multicore Processing in ATLAS

    CERN Document Server

    van Gemmeren, P; The ATLAS collaboration; Calafiura, P; Lavrijsen, W; Malon, D; Tsulaia, V

    2012-01-01

    A critical component of any multicore/manycore application architecture is the handling of input and output. Even in the simplest of models, design decisions interact both in obvious and in subtle ways with persistence strategies. When multiple workers handle I/O independently using distinct instances of a serial I/O framework, for example, it may happen that because of the way data from consecutive events are compressed together, there may be serious inefficiencies, with workers redundantly reading the same buffers, or multiple instances thereof. With shared reader strategies, caching and buffer management by the persistence infrastructure and by the control framework may have decisive performance implications for a variety of design choices. Providing the next event may seem straightforward when all event data are contiguously stored in a block, but there may be performance penalties to such strategies when only a subset of a given event's data are needed; conversely, when event data are partitioned by type...

  14. Comparing and Optimising Parallel Haskell Implementations for Multicore Machines

    DEFF Research Database (Denmark)

    Berthold, Jost; Marlow, Simon; Hammond, Kevin

    2009-01-01

    by our testing: for example, we implemented a work-stealing approach to task allocation. Our optimisations improved the performance of the shared-heap GpH implementation by as much as 30% on eight cores. Secondly, the shared heap approach is, rather surprisingly, not superior to a distributed heap......H implementation investigated here uses a physically-shared heap, which should be well-suited to multicore architectures. In contrast, the Eden implementation adopts an approach that has been designed for use on distributed-memory parallel machines: a system of multiple, independent heaps (one per core......), with inter-core communication handled by message-passing rather than through shared heap cells. We report two main results. Firstly, we report on the effect of a number of optimisations that we applied to the shared-memory GpH implementation in order to address some performance issues that were revealed...

  15. The Photonic TIGER: a multicore fiber-fed spectrograph

    CERN Document Server

    Leon-Saval, Sergio G; Bland-Hawthorn, Joss

    2012-01-01

    We present a proof of concept compact diffraction limited high-resolution fiber-fed spectrograph by using a 2D multicore array input. This high resolution spectrograph is fed by a 2D pseudo-slit, the Photonic TIGER, a hexagonal array of near-diffraction limited single-mode cores. We study the feasibility of this new platform related to the core array separation and rotation with respect to the dispersion axis. A 7 core compact Photonic TIGER fiber-fed spectrograph with a resolving power of around R~31000 and 8 nm bandwidth in the IR centered on 1550 nm is demonstrated. We also describe possible architectures based on this concept for building small scale compact diffraction limited Integral Field Spectrographs (IFS).

  16. Scalable Multi-core Architectures Design Methodologies and Tools

    CERN Document Server

    Jantsch, Axel

    2012-01-01

    As Moore’s law continues to unfold, two important trends have recently emerged. First, the growth of chip capacity is translated into a corresponding increase of number of cores. Second, the parallalization of the computation and 3D integration technologies lead to distributed memory architectures. This book provides a current snapshot of industrial and academic research, conducted as part of the European FP7 MOSART project, addressing urgent challenges in many-core architectures and application mapping.  It addresses the architectural design of many core chips, memory and data management, power management, design and programming methodologies. It also describes how new techniques have been applied in various industrial case studies. Describes trends towards distributed memory architectures and distributed power management; Integrates Network on Chip with distributed, shared memory architectures; Demonstrates novel design methodologies and frameworks for multi-core design space exploration; Shows how midll...

  17. Workload-aware VM Scheduling on Multicore Systems

    Directory of Open Access Journals (Sweden)

    Insoon Jo

    2011-11-01

    Full Text Available In virtualized environments, performance interference between virtual machines (VMs is a key challenge. In order to mitigate resource contention, an efficient VM scheduling is positively necessary.In this paper, we propose a workload-aware VM scheduler on multi-core systems, which finds a systemwide mapping of VMs to physical cores. Our work aims not only at minimizing the number of used hosts,but at maximizing the system throughput. To achieve the first goal, our scheduler dynamically adjusts a set of used hosts. To achieve the second goal, it maps each VM on a physical core where the physical core and its host most sufficiently meet the resource requirements of the VM. Evaluation demonstrates that our scheduling ensures efficient use of data center resources.

  18. Single-shot polarimetry imaging of multicore fiber.

    Science.gov (United States)

    Sivankutty, Siddharth; Andresen, Esben Ravn; Bouwmans, Géraud; Brown, Thomas G; Alonso, Miguel A; Rigneault, Hervé

    2016-05-01

    We report an experimental test of single-shot polarimetry applied to the problem of real-time monitoring of the output polarization states in each core within a multicore fiber bundle. The technique uses a stress-engineered optical element, together with an analyzer, and provides a point spread function whose shape unambiguously reveals the polarization state of a point source. We implement this technique to monitor, simultaneously and in real time, the output polarization states of up to 180 single-mode fiber cores in both conventional and polarization-maintaining fiber bundles. We demonstrate also that the technique can be used to fully characterize the polarization properties of each individual fiber core, including eigen-polarization states, phase delay, and diattenuation.

  19. FPGA wavelet processor design using language for instruction-set architectures (LISA)

    Science.gov (United States)

    Meyer-Bäse, Uwe; Vera, Alonzo; Rao, Suhasini; Lenk, Karl; Pattichis, Marios

    2007-04-01

    The design of an microprocessor is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration, software design (assembler, linker, loader, profiler), architecture implementation (RTL generation for FPGA or cell-based ASIC) and verification. The Language for instruction-set architectures (LISA) allows to model a microprocessor not only from instruction-set but also from architecture description including pipelining behavior that allows a design and development tool consistency over all levels of the design. To explore the capability of the LISA processor design platform a.k.a. CoWare Processor Designer we present in this paper three microprocessor designs that implement a 8/8 wavelet transform processor that is typically used in today's FBI fingerprint compression scheme. We have designed a 3 stage pipelined 16 bit RISC processor (NanoBlaze). Although RISC μPs are usually considered "fast" processors due to design concept like constant instruction word size, deep pipelines and many general purpose registers, it turns out that DSP operations consume essential processing time in a RISC processor. In a second step we have used design principles from programmable digital signal processor (PDSP) to improve the throughput of the DWT processor. A multiply-accumulate operation along with indirect addressing operation were the key to achieve higher throughput. A further improvement is possible with today's FPGA technology. Today's FPGAs offer a large number of embedded array multipliers and it is now feasible to design a "true" vector processor (TVP). A multiplication of two vectors can be done in just one clock cycle with our TVP, a complete scalar product in two clock cycles. Code profiling and Xilinx FPGA ISE synthesis results are provided that demonstrate the essential improvement that a TVP has compared with traditional RISC or PDSP designs.

  20. I/O Strategies for Multicore Processing in ATLAS

    Science.gov (United States)

    van Gemmeren, P.; Binet, S.; Calafiura, P.; Lavrijsen, W.; Malon, D.; Tsulaia, V.

    2012-12-01

    A critical component of any multicore/manycore application architecture is the handling of input and output. Even in the simplest of models, design decisions interact both in obvious and in subtle ways with persistence strategies. When multiple workers handle I/O independently using distinct instances of a serial I/O framework, for example, it may happen that because of the way data from consecutive events are compressed together, there may be serious inefficiencies, with workers redundantly reading the same buffers, or multiple instances thereof. With shared reader strategies, caching and buffer management by the persistence infrastructure and by the control framework may have decisive performance implications for a variety of design choices. Providing the next event may seem straightforward when all event data are contiguously stored in a block, but there may be performance penalties to such strategies when only a subset of a given event's data are needed; conversely, when event data are partitioned by type in persistent storage, providing the next event becomes more complicated, requiring marshalling of data from many I/O buffers. Output strategies pose similarly subtle problems, with complications that may lead to significant serialization and the possibility of serial bottlenecks, either during writing or in post-processing, e.g., during data stream merging. In this paper we describe the I/O components of AthenaMP, the multicore implementation of the ATLAS control framework, and the considerations that have led to the current design, with attention to how these I/O components interact with ATLAS persistent data organization and infrastructure.

  1. Photonic bandgap fiber lasers and multicore fiber lasers for next generation high power lasers

    DEFF Research Database (Denmark)

    Shirakawa, A.; Chen, M.; Suzuki, Y.

    2014-01-01

    Photonic bandgap fiber lasers are realizing new laser spectra and nonlinearity mitigation that a conventional fiber laser cannot. Multicore fiber lasers are a promising tool for power scaling by coherent beam combination. © 2014 OSA.......Photonic bandgap fiber lasers are realizing new laser spectra and nonlinearity mitigation that a conventional fiber laser cannot. Multicore fiber lasers are a promising tool for power scaling by coherent beam combination. © 2014 OSA....

  2. Nonlinear switching in multicore versus multimode waveguide junctions for mode-locked laser applications.

    Science.gov (United States)

    Nazemosadat, Elham; Mafi, Arash

    2013-12-16

    The main differences in nonlinear switching behavior between multicore versus multimode waveguide couplers are highlighted. By gradually decreasing the separation between the two cores of a dual-core waveguide and interpolating from a multicore to a multimode scenario, the role of the linear coupling, self-phase modulation, cross-phase modulation, and four-wave mixing terms are explored, and the key reasons are identified behind higher switching power requirements and lower switching quality in multimode nonlinear couplers.

  3. Multi-core Fibers in Submarine Networks for High-Capacity Undersea Transmission Systems

    DEFF Research Database (Denmark)

    Nooruzzaman, Md; Morioka, Toshio

    2017-01-01

    Application of multi-core fibers in undersea networks for high-capacity submarine transmission systems is studied. It is demonstrated how different architectures of submerged branching unit affect network component counts in long-haul undersea transmission systems......Application of multi-core fibers in undersea networks for high-capacity submarine transmission systems is studied. It is demonstrated how different architectures of submerged branching unit affect network component counts in long-haul undersea transmission systems...

  4. Low-Power and High Speed 128-Point Pipline FFT/IFFT Processor for OFDM Applications

    Directory of Open Access Journals (Sweden)

    D. Rajaveerappa

    2012-03-01

    Full Text Available This paper represents low power and high speed 128-point pipelined Fast Fourier Transform (FFT and its inverse Fast Fourier Transform (IFFT processor for OFDM. The Modified architecture also provides concept of ROM module and variable length support from 128~2048 point for FFT/IFFT for OFDM applications such as digital audio broadcasting (DAB, digital video broadcasting-terrestrial (DVB-T, asymmetric digital subscriber loop (ADSL and very-high-speed digital subscriber loop (VDSL. The 128-point architecture consists of an optimized pipeline implementation based on Radix-2 butterfly processor Element. To reduce power consumption and chip area, special current-mode SRAMs are adopted to replace shift registers in the delay lines. In low-power operation, when the supply voltage is scaled down to 2.3 V, the processor consumes 176mW when it runs at 17.8 MHz.

  5. The Associative Memory Boards for the FTK Processor at ATLAS

    CERN Document Server

    Calabro, D; The ATLAS collaboration; Citraro, S; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    The Associative Memory (AM) system, the main part of the FastTracker (FTK) processor, is designed to perform pattern matching using the information of the silicon tracking detectors. It finds track candidates at low resolution that are seeds for the following step performing precise track fitting. The system has to support challenging data traffic, handled by a group of modern low cost FPGAs, the Xilinx Spartan6 chips, which have Low-Power Gigabit Transceivers (GTP). Each GTP transceiver is a combined transmitter and receiver capable of operating at data rates up to 3.2 Gb/s. \

  6. On-board neural processor design for intelligent multisensor microspacecraft

    Science.gov (United States)

    Fang, Wai-Chi; Sheu, Bing J.; Wall, James

    1996-03-01

    A compact VLSI neural processor based on the Optimization Cellular Neural Network (OCNN) has been under development to provide a wide range of support for an intelligent remote sensing microspacecraft which requires both high bandwidth communication and high- performance computing for on-board data analysis, thematic data reduction, synergy of multiple types of sensors, and other advanced smart-sensor functions. The OCNN is developed with emphasis on its capability to find global optimal solutions by using a hardware annealing method. The hardware annealing function is embedded in the network. It is a parallel version of fast mean-field annealing in analog networks, and is highly efficient in finding globally optimal solutions for cellular neural networks. The OCNN is designed to perform programmable functions for fine-grained processing with annealing control to enhance the output quality. The OCNN architecture is a programmable multi-dimensional array of neurons which are locally connected with their local neurons. Major design features of the OCNN neural processor includes massively parallel neural processing, hardware annealing capability, winner-take-all mechanism, digitally programmable synaptic weights, and multisensor parallel interface. A compact current-mode VLSI design feasibility of the OCNN neural processor is demonstrated by a prototype 5 X 5-neuroprocessor array chip in a 2-micrometers CMOS technology. The OCNN operation theory, architecture, design and implementation, prototype chip, and system applications have been investigated in detail and presented in this paper.

  7. Ultrafast Fourier-transform parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  8. Adapting implicit methods to parallel processors

    Energy Technology Data Exchange (ETDEWEB)

    Reeves, L.; McMillin, B.; Okunbor, D.; Riggins, D. [Univ. of Missouri, Rolla, MO (United States)

    1994-12-31

    When numerically solving many types of partial differential equations, it is advantageous to use implicit methods because of their better stability and more flexible parameter choice, (e.g. larger time steps). However, since implicit methods usually require simultaneous knowledge of the entire computational domain, these methods axe difficult to implement directly on distributed memory parallel processors. This leads to infrequent use of implicit methods on parallel/distributed systems. The usual implementation of implicit methods is inefficient due to the nature of parallel systems where it is common to take the computational domain and distribute the grid points over the processors so as to maintain a relatively even workload per processor. This creates a problem at the locations in the domain where adjacent points are not on the same processor. In order for the values at these points to be calculated, messages have to be exchanged between the corresponding processors. Without special adaptation, this will result in idle processors during part of the computation, and as the number of idle processors increases, the lower the effective speed improvement by using a parallel processor.

  9. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural

  10. Multi-output programmable quantum processor

    OpenAIRE

    Yu, Yafei; Feng, Jian; Zhan, Mingsheng

    2002-01-01

    By combining telecloning and programmable quantum gate array presented by Nielsen and Chuang [Phys.Rev.Lett. 79 :321(1997)], we propose a programmable quantum processor which can be programmed to implement restricted set of operations with several identical data outputs. The outputs are approximately-transformed versions of input data. The processor successes with certain probability.

  11. 7 CFR 1215.14 - Processor.

    Science.gov (United States)

    2010-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE POPCORN PROMOTION, RESEARCH, AND CONSUMER INFORMATION Popcorn Promotion, Research, and Consumer Information Order Definitions § 1215.14 Processor. Processor means a person engaged in the preparation of unpopped popcorn for the market who...

  12. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural enhancements

  13. Advanced Multiple Processor Configuration Study. Final Report.

    Science.gov (United States)

    Clymer, S. J.

    This summary of a study on multiple processor configurations includes the objectives, background, approach, and results of research undertaken to provide the Air Force with a generalized model of computer processor combinations for use in the evaluation of proposed flight training simulator computational designs. An analysis of a real-time flight…

  14. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably se

  15. An Empirical Evaluation of XQuery Processors

    NARCIS (Netherlands)

    Manegold, S.

    2008-01-01

    This paper presents an extensive and detailed experimental evaluation of XQuery processors. The study consists of running five publicly available XQuery benchmarks --- the Michigan benchmark (MBench), XBench, XMach-1, XMark and X007 --- on six XQuery processors, three stand-alone (file-based) XQuery

  16. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably

  17. Towards a Process Algebra for Shared Processors

    DEFF Research Database (Denmark)

    Buchholtz, Mikael; Andersen, Jacob; Løvengreen, Hans Henrik

    2002-01-01

    We present initial work on a timed process algebra that models sharing of processor resources allowing preemption at arbitrary points in time. This enables us to model both the functional and the timely behaviour of concurrent processes executed on a single processor. We give a refinement relation...

  18. Verilog Implementation of 32-Bit CISC Processor

    Directory of Open Access Journals (Sweden)

    P.Kanaka Sirisha

    2016-04-01

    Full Text Available The Project deals with the design of the 32-Bit CISC Processor and modeling of its components using Verilog language. The Entire Processor uses 32-Bit bus to deal with all the registers and the memories. This Processor implements various arithmetic, logical, Data Transfer operations etc., using variable length instructions, which is the core property of the CISC Architecture. The Processor also supports various addressing modes to perform a 32-Bit instruction. Our Processor uses Harvard Architecture (i.e., to have a separate program and data memory and hence has different buses to negotiate with the Program Memory and Data Memory individually. This feature enhances the speed of our processor. Hence it has two different Program Counters to point to the memory locations of the Program Memory and Data Memory.Our processor has ‘Instruction Queuing’ which enables it to save the time needed to fetch the instruction and hence increases the speed of operation. ‘Interrupt Service Routine’ is provided in our Processor to make it address the Interrupts.

  19. Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems

    Science.gov (United States)

    2017-04-13

    AFRL-AFOSR-UK-TR-2017-0029 Automated and Assistive Tools for Accelerated Code migration of Scientific Computing on to Heterogeneous MultiCore Systems ...MultiCore Systems 5a. CONTRACT NUMBER FA8655-12-1-2021 5b. GRANT NUMBER Grant 12-2021 5c. PROGRAM ELEMENT NUMBER 61102F 6. AUTHOR(S...code for Heterogeneous multicore systems . The approach was based on the OmpSs programming model and the performance tools that constitute two strategic

  20. Affinity-Based Network Interfaces for Efficient Communication on Multicore Architectures

    Institute of Scientific and Technical Information of China (English)

    Andrés Ortiz; Julio Ortega; Antonio F.Díaz; Alberto Prieto

    2013-01-01

    Improving the network interface performance is needed by the demand of applications with high communication requirements (for example,some multimedia,real-time,and high-performance computing applications),and the availability of network links providing multiple gigabits per second bandwidths that could require many processor cycles for communication tasks.Multicore architectures,the current trend in the microprocessor development to cope with the difficulties to further increase clock frequencies and microarchitecture efficiencies,provide new opportunities to exploit the parallelism available in the nodes for designing efficient communication architectures.Nevertheless,although present OS network stacks include multiple threals that make it possible to execute network tasks concurrently in the kernel,the implementations of packet-based or connection-based parallelism are not trivial as they have to take into account issues related with the cost of synchronization in the access to shared resources and the efficient use of caches.Therefore,a common trend in many recent researches on this topic is to assign network interrupts and the corresponding protocol and network application processing to the same core,as with this affinity scheduling it would be possible to reduce the contention for shared resources and the cache misses.In this paper we propose and analyze several configurations to distribute the network interface among the different cores available in the server.These alternatives have been devised according to the affinity of the corresponding communication tasks with the location (proximity to the memories where the different data structures are stored) and characteristics of the processing core.As this approach uses several cores to accelerate the communication path of a given connection,it can be seen as complementary to those that consider several cores to simultaneously process packets belonging to either the same or different connections.Message passing

  1. Neurovision processor for designing intelligent sensors

    Science.gov (United States)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  2. Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

    Institute of Scientific and Technical Information of China (English)

    郑方; 李宏亮; 吕晖; 过锋; 许晓红; 谢向辉

    2015-01-01

    Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele-ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.

  3. Computing trends using graphic processor in high energy physics

    CERN Document Server

    Niculescu, Mihai

    2011-01-01

    One of the main challenges in Heavy Energy Physics is to make fast analysis of high amount of experimental and simulated data. At LHC-CERN one p-p event is approximate 1 Mb in size. The time taken to analyze the data and obtain fast results depends on high computational power. The main advantage of using GPU(Graphic Processor Unit) programming over traditional CPU one is that graphical cards bring a lot of computing power at a very low price. Today a huge number of application(scientific, financial etc) began to be ported or developed for GPU, including Monte Carlo tools or data analysis tools for High Energy Physics. In this paper, we'll present current status and trends in HEP using GPU.

  4. 基于System Generator的异构多核片上系统设计%Design of Heterogeneous Multi-core SoC Based on System Generator

    Institute of Scientific and Technical Information of China (English)

    杨宏来; 黄旻

    2012-01-01

      Heterogeneous multi-core processors can assign different types of tasks to different types of processor cores to parallelly proces, therefore, when faced with different application requirements, it can provide more flexible and efficient mechanism. This paper proposes a method to design heterogeneous multi-core system oriented SoC, and it can convenient to realize the processing of image. Described the basic methods of image degradation and restoration, and give the basic model of the algorithm. and making system modeling and simulation using the digital signal processing development tool System Generator. The EDK processor can generated the coprocessor Pcore. Build a heterogeneous multi-core SoC using the Pcore combined with the Xilinx microblaze soft-core. The method reduces the most part of the work of writing HDL manually and makes the design process tend to simplify.%  异构多核处理器可将不同类型的任务分配到不同类型的处理器核上并行处理,面对不同的应用需求,可以提供比较灵活、高效的处理机制。文中提出一种面向SoC的异构多核系统的设计方法,运用该方法可高效方便地实现图像处理算法。首先对图像退化和复原的基本方法进行介绍,给出算法实现的基本模型,并运用数字信号处理开发工具System Generator进行系统级建模仿真。然后通过EDK Processor自动生成图像退化和复原的协处理器Pcore,结合Xilinx的MicroBlaze软核,构建出异构多核片上系统。

  5. Application of LBM on Multi-Core Parallel Programming Model%LBM在多核并行编程模型中的应用

    Institute of Scientific and Technical Information of China (English)

    李彬彬; 李青

    2011-01-01

    LBGK ( Lattice Bhatnagar-Gross-Krook ) model is not only the new ground on theory and application of LBM ( Lattice Boltzmann Method) ,but also a very novel numerical method. It applys to the massively parallel processing. With management of threads,MTI ( Multi-Thread Interface ) provides two main methods for parallel coding on multicore processor computer. One is data parallelism based on cache blocking, the other is a tasks schedule with working stealing. MTI provides an interface for the development of multicore environment conveniently and efficiently, greatly reducing the burden on developers. An LBGK model for pattern formation is realized by MTI, and the numerical results show that MTI is efficient and easy to use.%LBGK(Lattice Bhatnagar-Gross-Krook)模IV不仅是LBM(Lattice Boltzmann Method)理论及应用上的新突破,而且是一种非常新颖的数值计算方法,适合大规模并行计算.多线程并行编程接11库(Multi-Thread Interface,MTI)充分利用多核处理器的资源来提升计算的性能,为在多核环境下方便地开发高效的并行程序提供了一个接口,大大地减轻了开发人员的负担.MTI提供了使用cache块技术划分数据集实现单任务数据并行计算,以及采用任务密取调度策略实现多任务并行处理.应用MI实现了LBGK模型模拟斑图形成的并行计算,并获得了较高的并行效率.

  6. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...

  7. Processor arrays with asynchronous TDM optical buses

    Science.gov (United States)

    Li, Y.; Zheng, S. Q.

    1997-04-01

    We propose a pipelined asynchronous time division multiplexing optical bus. Such a bus can use one of the two hardwared priority schemes, the linear priority scheme and the round-robin priority scheme. Our simulation results show that the performances of our proposed buses are significantly better than the performances of known pipelined synchronous time division multiplexing optical buses. We also propose a class of processor arrays connected by pipelined asynchronous time division multiplexing optical buses. We claim that our proposed processor array not only have better performance, but also have better scalabilities than the existing processor arrays connected by pipelined synchronous time division multiplexing optical buses.

  8. Next generation Associative Memory devices for the FTK tracking processor of the ATLAS experiment

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, B; Beretta, M; Citterio, M; Crescioli, F; Colombo, A; Giannetti, P; Liberali, V; Shojaii, J; Stabile, A

    2013-01-01

    The AMchip is a VLSI device that implements the associative memory function, a special content addressable memory specifically designed for high energy physics applications and first used in the CDF experiment at Tevatron. The 4th generation of AMchip has been developed for the core pattern recognition stage of the Fast TracKer (FTK) processor: a hardware processor for online reconstruction of particle trajectories at the ATLAS experiment at LHC. We present the architecture, design considerations, power consumption and performance measurements of the 4th generation of AMchip. We present also the design innovations toward the 5th generation and the first prototype results.

  9. Feature detection and SLAM on embedded processors for micro-robot navigation

    Science.gov (United States)

    Robinette, Paul; Collins, Thomas R.

    2013-05-01

    We have developed software that allows a micro-robot to localize itself at a 1Hz rate using only onboard hardware. The Surveyor SRV-1 robot and its Blackfin processors were used to perform FAST feature detection on images. Good features selected from these images were then described using the SURF descriptor algorithm. An onboard Gumstix then correlated the features reported by the two processors and used GTSAM to develop an estimate of robot localization and landmark positions. Localization errors in this system were on the same order of magnitude as the size of the robot itself, giving the robot the potential to autonomously operate in a real-world environment.

  10. The MIDAS processor. [Multivariate Interactive Digital Analysis System for multispectral scanner data

    Science.gov (United States)

    Kriegler, F. J.; Gordon, M. F.; Mclaughlin, R. H.; Marshall, R. E.

    1975-01-01

    The MIDAS (Multivariate Interactive Digital Analysis System) processor is a high-speed processor designed to process multispectral scanner data (from Landsat, EOS, aircraft, etc.) quickly and cost-effectively to meet the requirements of users of remote sensor data, especially from very large areas. MIDAS consists of a fast multipipeline preprocessor and classifier, an interactive color display and color printer, and a medium scale computer system for analysis and control. The system is designed to process data having as many as 16 spectral bands per picture element at rates of 200,000 picture elements per second into as many as 17 classes using a maximum likelihood decision rule.

  11. Quantization analysis of the real-time SAR digital image formation processor

    Energy Technology Data Exchange (ETDEWEB)

    Magotra, N.

    1988-12-01

    This report presents a quantization analysis of the digital image formation processor (IFP) of a linear-FM synthetic aperture radar (SAR). The IFP is configured as a patch processor and forms the final image by performing a two dimensional Fast Fourier Transform (FFT). The quantization analysis examines the effects of using fixed precision arithmetic in the image formation process. Theoretical bounds for the worst-case errors introduced by using fixed point arithmetic and experimental results verifying the theoretical bounds are presented. 34 refs., 23 figs., 7 tabs.

  12. Unified UDispatch: A User Dispatching Tool for Multicore Systems

    Institute of Scientific and Technical Information of China (English)

    Tang-Hsun Tu; Chih-Wen Hsueh

    2011-01-01

    In multicore environment, multithreading is often used to improve application performance. However, even in many simple applications, the performance might degrade when the number of threads increases. Users usually impute this phenomenon to the overhead of creation or termination of threads. In our observation, how the threads are dispatched to the multiple cores might have a more significant effect. We formally defined the problems on using threads as multithreading anomalies, and presented a novel user dispatching mechanism (UDispatch) which provides controllability in user space to improve application performance. Through modification of application source codes with the UDispatch application programming interface (API), the application performance can be improved significantly. However, since the application source codes might not be available or it might be too complicated to modify application source codes, we provided an extension, called UDispatch+, to dispatch threads without any modification of application source codes. In this paper, the UDispatch and UDispatch+ are integrated and wrapped for more portability and introduced as a tool called Unified UDispatch (UUD) with more detailed experiments and description. It can dispatch the application threads to specific cores at the discretion of users with up to 171.8% performance improvement on a 4-core machine.

  13. Variations on Multi-Core Nested Depth-First Search

    Directory of Open Access Journals (Sweden)

    Alfons Laarman

    2011-10-01

    Full Text Available Recently, two new parallel algorithms for on-the-fly model checking of LTL properties were presented at the same conference: Automated Technology for Verification and Analysis, 2011. Both approaches extend Swarmed NDFS, which runs several sequential NDFS instances in parallel. While parallel random search already speeds up detection of bugs, the workers must share some global information in order to speed up full verification of correct models. The two algorithms differ considerably in the global information shared between workers, and in the way they synchronize. Here, we provide a thorough experimental comparison between the two algorithms, by measuring the runtime of their implementations on a multi-core machine. Both algorithms were implemented in the same framework of the model checker LTSmin, using similar optimizations, and have been subjected to the full BEEM model database. Because both algorithms have complementary advantages, we constructed an algorithm that combines both ideas. This combination clearly has an improved speedup. We also compare the results with the alternative parallel algorithm for accepting cycle detection OWCTY-MAP. Finally, we study a simple statistical model for input models that do contain accepting cycles. The goal is to distinguish the speedup due to parallel random search from the speedup that can be attributed to clever work sharing schemes.

  14. Scalable Multicore Motion Planning Using Lock-Free Concurrency

    Science.gov (United States)

    Ichnowski, Jeffrey; Alterovitz, Ron

    2015-01-01

    We present PRRT (Parallel RRT) and PRRT* (Parallel RRT*), sampling-based methods for feasible and optimal motion planning designed for modern multicore CPUs. We parallelize RRT and RRT* such that all threads concurrently build a single motion planning tree. Parallelization in this manner requires that data structures, such as the nearest neighbor search tree and the motion planning tree, are safely shared across multiple threads. Rather than rely on traditional locks which can result in slowdowns due to lock contention, we introduce algorithms based on lock-free concurrency using atomic operations. We further improve scalability by using partition-based sampling (which shrinks each core’s working data set to improve cache efficiency) and parallel work-saving (in reducing the number of rewiring steps performed in PRRT*). Because PRRT and PRRT* are CPU-based, they can be directly integrated with existing libraries. We demonstrate that PRRT and PRRT* scale well as core counts increase, in some cases exhibiting superlinear speedup, for scenarios such as the Alpha Puzzle and Cubicles scenarios and the Aldebaran Nao robot performing a 2-handed task. PMID:26167135

  15. Annular-cladding erbium doped multicore fiber for SDM amplification.

    Science.gov (United States)

    Jin, Cang; Ung, Bora; Messaddeq, Younès; LaRochelle, Sophie

    2015-11-16

    We propose and numerically investigate annular-cladding erbium doped multicore fibers (AC-EDMCF) with either solid or air hole inner cladding to enhance the pump power efficiency in optical amplifiers for spatial division multiplexing (SDM) transmission links. We first propose an all-glass fiber in which a central inner cladding region with a depressed refractive index is introduced to confine the pump inside a ring-shaped region overlapping the multiple signal cores. Through numerical simulations, we determine signal core and annular pump cladding parameters respecting fabrication constraints. We also propose and examine a multi-spot injection scheme for launching the pump in the annular cladding. With this all-glass fiber with annular cladding, our results predict 10 dB increase in gain and 21% pump power savings compared to the standard double cladding design. We also investigate a fiber with an air hole inner cladding to further enhance the pump power confinement and minimize power leaking into the inner cladding. The results are compared to the all-glass AC-EDMCF.

  16. Mobile Computing Clouds Interactive Model and Algorithm Based On Multi-core Grids

    Directory of Open Access Journals (Sweden)

    Liu Lizhao

    2013-09-01

    Full Text Available Multi-core technology is the key technology of mobile cloud computing, with the boom development of cloud technology, the authors focus on the problem of how to make the target code computed by mobile cloud terminal multi-core compiler to use cloud multi-core system construction, to ensure synchronization of data cross-validation compilation, and propose the concept of end mobile cloud entity indirect synchronization and direct synchronization; use wave ormation energy conversion, give our a method to calculate indirect synchronization value and direct synchronization value according to the cross experience and cross time of compilation entity; construct function relative level algorithm with Hellinger distance,  and give an algorithm method of comprehensive synchronization value. Through experiment statistics and analysis, take threshold limit value as the average, self-synchronization value as deviation, the update function of indirect synchronization value is constructed; an inter-domain multi-core synchronization flow chart is given; then inter-domain compilation data synchronization update experiment is carried out with more than 3000 end mobile cloud multi-core compilation environment. Through the analysis of data compilation operation process and results, the synchronization algorithm is proved to be reasonable and effective.  

  17. The optimization of improved mean shift object tracking in embedded multicore DSP parallel system

    Science.gov (United States)

    Tian, Li; Zhou, Fugen; Meng, Cai; Hu, Congliang

    2014-11-01

    This paper proposes a more robust and efficient Mean Shift object tracking algorithm which is optimized for embedded multicore DSP Parallel system. Firstly, the RGB image is transformed into HSV image which is robust in many aspects such as lighting changes. Then, the color histogram model is used in the back projection process to generate the color probability distribution. Secondly, the size and position of search window are initialized in the first frame, and Mean Shift algorithm calculates the center position of the target and adjusts the search window automatically both in size and location, according to the result of the previous frame. Finally, since the multicore DSP system is commonly adopted in the embedded application such as seeker and an optical scout system, we implement the proposed algorithm in the TI multicore DSP system to meet the need of large amount computation. For multicore parallel computing, the explicit IPC based multicore framework is designed which outperforms OpenMP standard. Moreover, the parallelisms of 8 functional units and cross path data fetch capability of C66 core are utilized to accelerate the computation of iteration in Mean Shift algorithm. The experimental results show that the algorithm has good performance in complex scenes such as deformation, scale change and occlusion, simultaneously the proposed optimization method can significantly reduce the computation time.

  18. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Oliker, Leonid; Vuduc, Richard; Shalf, John; Yelick, Katherine; Demmel, James

    2008-10-16

    We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

  19. Performance modeling and analysis of parallel Gaussian elimination on multi-core computers

    Directory of Open Access Journals (Sweden)

    Fadi N. Sibai

    2014-01-01

    Full Text Available Gaussian elimination is used in many applications and in particular in the solution of systems of linear equations. This paper presents mathematical performance models and analysis of four parallel Gaussian Elimination methods (precisely the Original method and the new Meet in the Middle –MiM– algorithms and their variants with SIMD vectorization on multi-core systems. Analytical performance models of the four methods are formulated and presented followed by evaluations of these models with modern multi-core systems’ operation latencies. Our results reveal that the four methods generally exhibit good performance scaling with increasing matrix size and number of cores. SIMD vectorization only makes a large difference in performance for low number of cores. For a large matrix size (n ⩾ 16 K, the performance difference between the MiM and Original methods falls from 16× with four cores to 4× with 16 K cores. The efficiencies of all four methods are low with 1 K cores or more stressing a major problem of multi-core systems where the network-on-chip and memory latencies are too high in relation to basic arithmetic operations. Thus Gaussian Elimination can greatly benefit from the resources of multi-core systems, but higher performance gains can be achieved if multi-core systems can be designed with lower memory operation, synchronization, and interconnect communication latencies, requirements of utmost importance and challenge in the exascale computing age.

  20. Optimization of sparse matrix-vector multiplication on emerging multicore platforms

    Energy Technology Data Exchange (ETDEWEB)

    Williams, S; Oliker, L; Vuduc, R; Shalf, J; Yelick, K; Demmel, J

    2007-04-16

    We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV)--one of the most heavily used kernels in scientific computing--across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, as well as the highly multithreaded Sun Niagara and heterogeneous STI Cell. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

  1. Concept of a Supervector Processor: A Vector Approach to Superscalar Processor, Design and Performance Analysis

    Directory of Open Access Journals (Sweden)

    Deepak Kumar, Ranjan Kumar Behera, K. S. Pandey

    2013-07-01

    Full Text Available To maximize the available performance is always a goal in microprocessor design. In this paper a new technique has been implemented which exploits the advantage of both superscalar and vector processing technique in a proposed processor called Supervector processor. Vector processor operates on array of data called vector and can greatly improve certain task such as numerical simulation and tasks which requires huge number crunching. On other handsuperscalar processor issues multiple instructions per cyclewhich can enhance the throughput. To implement parallelism multiple vector instructions were issued and executed per cycle in superscalar fashion. Case study has been done on various benchmarks to compare the performance of proposedsupervector processor architecture with superscalar and vectorprocessor architecture. Trimaran Framework has been used in order to evaluate the performance of the proposed supervector processor scheme.

  2. Photonics and Fiber Optics Processor Lab

    Data.gov (United States)

    Federal Laboratory Consortium — The Photonics and Fiber Optics Processor Lab develops, tests and evaluates high speed fiber optic network components as well as network protocols. In addition, this...

  3. Radiation Tolerant Software Defined Video Processor Project

    Data.gov (United States)

    National Aeronautics and Space Administration — MaXentric's is proposing a radiation tolerant Software Define Video Processor, codenamed SDVP, for the problem of advanced motion imaging in the space environment....

  4. Processor-Dependent Malware... and codes

    CERN Document Server

    Desnos, Anthony; Filiol, Eric

    2010-01-01

    Malware usually target computers according to their operating system. Thus we have Windows malwares, Linux malwares and so on ... In this paper, we consider a different approach and show on a technical basis how easily malware can recognize and target systems selectively, according to the onboard processor chip. This technology is very easy to build since it does not rely on deep analysis of chip logical gates architecture. Floating Point Arithmetic (FPA) looks promising to define a set of tests to identify the processor or, more precisely, a subset of possible processors. We give results for different families of processors: AMD, Intel (Dual Core, Atom), Sparc, Digital Alpha, Cell, Atom ... As a conclusion, we propose two {\\it open problems} that are new, to the authors' knowledge.

  5. Critical review of programmable media processor architectures

    Science.gov (United States)

    Berg, Stefan G.; Sun, Weiyun; Kim, Donglok; Kim, Yongmin

    1998-12-01

    In the past several years, there has been a surge of new programmable mediaprocessors introduced to provide an alternative solution to ASICs and dedicated hardware circuitries in the multimedia PC and embedded consumer electronics markets. These processors attempt to combine the programmability of multimedia-enhanced general purpose processors with the performance and low cost of dedicated hardware. We have reviewed five current multimedia architectures and evaluated their strengths and weaknesses.

  6. First Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Andreichenko, V. B.; Dotsenko S., Vi.; Shchur, L. N.

    We describe the architecture of the special purpose processor built to realize in hardware cluster Wolff algorithm, which is not hampered by a critical slowing down. The processor simulates two-dimensional Ising-like spin systems. With minor changes the same very effective architecture, which can be defined as a Memory Machine, can be used to study phase transitions in a wide range of models in two or three dimensions.

  7. A New Echeloned Poisson Series Processor (EPSP)

    Science.gov (United States)

    Ivanova, Tamara

    2001-07-01

    A specialized Echeloned Poisson Series Processor (EPSP) is proposed. It is a typical software for the implementation of analytical algorithms of Celestial Mechanics. EPSP is designed for manipulating long polynomial-trigonometric series with literal divisors. The coefficients of these echeloned series are the rational or floating-point numbers. The Keplerian processor and analytical generator of special celestial mechanics functions based on the EPSP are also developed.

  8. Evaluating current processors performance and machines stability

    CERN Document Server

    Esposito, R; Tortone, G; Taurino, F M

    2003-01-01

    Accurately estimate performance of currently available processors is becoming a key activity, particularly in HENP environment, where high computing power is crucial. This document describes the methods and programs, opensource or freeware, used to benchmark processors, memory and disk subsystems and network connection architectures. These tools are also useful to stress test new machines, before their acquisition or before their introduction in a production environment, where high uptimes are requested.

  9. A fast PC algorithm for high dimensional causal discovery with multi-core PCs

    OpenAIRE

    Le, Thuc Duy; Hoang, Tao; Li, Jiuyong; Liu, Lin; Liu, Huawen

    2015-01-01

    Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g. gene expression datasets. On another note, the advancement of computer hardware in the last decade ...

  10. SMART AS A CRYPTOGRAPHIC PROCESSOR

    Directory of Open Access Journals (Sweden)

    Saroja Kanchi

    2016-05-01

    Full Text Available SMaRT is a 16-bit 2.5-address RISC-type single-cycle processor, which was recently designed and successfully mapped into a FPGA chip in our ECE department. In this paper, we use SMaRT to run the well-known encryption algorithm, Data Encryption Standard. For information security purposes, encryption is a must in today’s sophisticated and ever-increasing computer communications such as ATM machines and SIM cards. For comparison and evaluation purposes, we also map the same algorithm on the HC12, a same-size but CISC-type off-the-shelf microcontroller, Our results show that compared to HC12, SMaRT code is only 14% longer in terms of the static number of instructions but about 10 times faster in terms of the number of clock cycles, and 7% smaller in terms of code size. Our results also show that 2.5- address instructions, a SMaRT selling point, amount to 45% of the whole R-type instructions resulting in significant improvement in static number of instructions hence code size as well as performance. Additionally, we see that the SMaRT short-branch range is sufficiently wide in 90% of cases in the SMaRT code. Our results also reveal that the SMaRT novel concept of locality of reference in using the MSBs of the registers in non-subroutine branch instructions stays valid with a remarkable hit rate of 95%!

  11. Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems

    Science.gov (United States)

    Rossinelli, Diego; Bergdorf, Michael; Hejazialhosseini, Babak; Koumoutsakos, Petros

    We build wavelet-based adaptive numerical methods for the simulation of advection dominated flows that develop multiple spatial scales, with an emphasis on fluid mechanics problems. Wavelet based adaptivity is inherently sequential and in this work we demonstrate that these numerical methods can be implemented in software that is capable of harnessing the capabilities of multi-core architectures while maintaining their computational efficiency. Recent designs in frameworks for multi-core software development allow us to rethink parallelism as task-based, where parallel tasks are specified and automatically mapped into physical threads. This way of exposing parallelism enables the parallelization of algorithms that were considered inherently sequential, such as wavelet-based adaptive simulations. In this paper we present a framework that combines wavelet-based adaptivity with the task-based parallelism. We demonstrate good scaling performance obtained by simulating diverse physical systems on different multi-core and SMP architectures using up to 16 cores.

  12. Capacity of Space-Division Multiplexing with Heterogeneous Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Peucheret, Christophe; Morioka, Toshio

    2013-01-01

    The capacity of heterogeneous multi-core fibers is explored, taking into account intra-core nonlinearities and inter-core crosstalk. Over 10 Pb/s transmission capacity can be anticipated for a densely-packed 93-core fiber with a 220 μm cladding diameter.......The capacity of heterogeneous multi-core fibers is explored, taking into account intra-core nonlinearities and inter-core crosstalk. Over 10 Pb/s transmission capacity can be anticipated for a densely-packed 93-core fiber with a 220 μm cladding diameter....

  13. Multicore optical fibre and fibre-optic delay line based on it

    Science.gov (United States)

    Egorova, O. N.; Astapovich, M. S.; Belkin, M. E.; Semjonov, S. L.

    2016-12-01

    The first switchable fibre-optic delay line based on a 1300-{\\text{m}}-long multicore optical fibre has been fabricated and investigated. We have obtained signal delay times of up to 45 \\unicode{956}{\\text{s}} at 6.43-\\unicode{956}{\\text{s}} intervals. Sequential signal propagation through the cores of the multicore optical fibre makes it possible to reduce the fibre length necessary for obtaining a predetermined delay time, which is important for reducing the weight and dimensions of devices based on the use of fibre-optic delay lines.

  14. Multiwatt octave-spanning supercontinuum generation in multicore photonic-crystal fiber.

    Science.gov (United States)

    Fang, Xiao-hui; Hu, Ming-lie; Huang, Li-li; Chai, Lu; Dai, Neng-li; Li, Jin-yan; Tashchilina, A Yu; Zheltikov, Aleksei M; Wang, Ching-yue

    2012-06-15

    High-power supercontinuum spanning over more than an octave was generated using a high power femtosecond fiber laser amplifier and a multicore nonlinear photonic crystal fiber (PCF). Long multicore PCFs (as long as 20 m in our experiments) are shown to enable supercontinuum generation in an isolated fundamental supermode, with the manifold of other PCF modes suppressed due to the strong evanescent fields coupling between the cores, providing a robust 5.4 W coherent supercontinuum output with a high spatial and spectral quality within the range of wavelengths from 500 to 1700 nm.

  15. Gestión de recursos en nodos multi-core de memoria compartida

    OpenAIRE

    Ferreira, Tharso de Souza

    2010-01-01

    La gestión de recursos en los procesadores multi-core ha ganado importancia con la evolución de las aplicaciones y arquitecturas. Pero esta gestión es muy compleja. Por ejemplo, una misma aplicación paralela ejecutada múltiples veces con los mismos datos de entrada, en un único nodo multi-core, puede tener tiempos de ejecución muy variables. Hay múltiples factores hardware y software que afectan al rendimiento. La forma en que los recursos hardware (cómputo y memoria) se asignan a los proceso...

  16. Ultrasound phase rotation beamforming on multi-core DSP.

    Science.gov (United States)

    Ma, Jieming; Karadayi, Kerem; Ali, Murtaza; Kim, Yongmin

    2014-01-01

    Phase rotation beamforming (PRBF) is a commonly-used digital receive beamforming technique. However, due to its high computational requirement, it has traditionally been supported by hardwired architectures, e.g., application-specific integrated circuits (ASICs) or more recently field-programmable gate arrays (FPGAs). In this study, we investigated the feasibility of supporting software-based PRBF on a multi-core DSP. To alleviate the high computing requirement, the analog front-end (AFE) chips integrating quadrature demodulation in addition to analog-to-digital conversion were defined and used. With these new AFE chips, only delay alignment and phase rotation need to be performed by DSP, substantially reducing the computational load. We implemented the delay alignment and phase rotation modules on a Texas Instruments C6678 DSP with 8 cores. We found it takes 200 μs to beamform 2048 samples from 64 channels using 2 cores. With 4 cores, 20 million samples can be beamformed in one second. Therefore, ADC frequencies up to 40 MHz with 2:1 decimation in AFE chips or up to 20 MHz with no decimation can be supported as long as the ADC-to-DSP I/O requirement can be met. The remaining 4 cores can work on back-end processing tasks and applications, e.g., color Doppler or ultrasound elastography. One DSP being able to handle both beamforming and back-end processing could lead to low-power and low-cost ultrasound machines, benefiting ultrasound imaging in general, particularly portable ultrasound machines.

  17. IDSP- INTERACTIVE DIGITAL SIGNAL PROCESSOR

    Science.gov (United States)

    Mish, W. H.

    1994-01-01

    The Interactive Digital Signal Processor, IDSP, consists of a set of time series analysis "operators" based on the various algorithms commonly used for digital signal analysis work. The processing of a digital time series to extract information is usually achieved by the application of a number of fairly standard operations. However, it is often desirable to "experiment" with various operations and combinations of operations to explore their effect on the results. IDSP is designed to provide an interactive and easy-to-use system for this type of digital time series analysis. The IDSP operators can be applied in any sensible order (even recursively), and can be applied to single time series or to simultaneous time series. IDSP is being used extensively to process data obtained from scientific instruments onboard spacecraft. It is also an excellent teaching tool for demonstrating the application of time series operators to artificially-generated signals. IDSP currently includes over 43 standard operators. Processing operators provide for Fourier transformation operations, design and application of digital filters, and Eigenvalue analysis. Additional support operators provide for data editing, display of information, graphical output, and batch operation. User-developed operators can be easily interfaced with the system to provide for expansion and experimentation. Each operator application generates one or more output files from an input file. The processing of a file can involve many operators in a complex application. IDSP maintains historical information as an integral part of each file so that the user can display the operator history of the file at any time during an interactive analysis. IDSP is written in VAX FORTRAN 77 for interactive or batch execution and has been implemented on a DEC VAX-11/780 operating under VMS. The IDSP system generates graphics output for a variety of graphics systems. The program requires the use of Versaplot and Template plotting

  18. Conversion of an 8-bit to a 16-bit Soft-core RISC Processor

    Directory of Open Access Journals (Sweden)

    Ahmad Jamal Salim

    2013-03-01

    Full Text Available The demand for 8-bit processors nowadays is still going strong despite efforts by manufacturers in producing higher end microcontroller solutions to the mass market. Low-end processor offers a simple, low-cost and fast solution especially on I/O applications development in embedded system. However, due to architectural constraint, complex calculation could not be performed efficiently on 8-bit processor. This paper presents the conversion method from an 8-bit to a 16-bit Reduced Instruction Set Computer (RISC processor in a soft-core reconfigurable platform in order to extend its capability in handling larger data sets thus enabling intensive calculations process. While the conversion expands the data bus width to 16-bit, it also maintained the simple architecture design of an 8-bit processor.The expansion also provides more room for improvement to the processor’s performance. The modified architecture is successfully simulated in CPUSim together with its new instruction set architecture (ISA. Xilinx Virtex-6 platform is utilized to execute and verified the architecture. Results show that the modified 16-bit RISC architecture only required 17% more register slice on Field Programmable Gate Array (FPGA implementation which is a slight increase compared to the original 8-bit RISC architecture. A test program containing instruction sets that handle 16-bit data are also simulated and verified. As the 16-bit architecture is described as a soft-core, further modifications could be performed in order to customize the architecture to suit any specific applications.

  19. PERFORMANCE EVALUATION OF OR1200 PROCESSOR WITH EVOLUTIONARY PARALLEL HPRC USING GEP

    Directory of Open Access Journals (Sweden)

    R. Maheswari

    2012-04-01

    Full Text Available In this fast computing era, most of the embedded system requires more computing power to complete the complex function/ task at the lesser amount of time. One way to achieve this is by boosting up the processor performance which allows processor core to run faster. This paper presents a novel technique of increasing the performance by parallel HPRC (High Performance Reconfigurable Computing in the CPU/DSP (Digital Signal Processor unit of OR1200 (Open Reduced Instruction Set Computer (RISC 1200 using Gene Expression Programming (GEP an evolutionary programming model. OR1200 is a soft-core RISC processor of the Intellectual Property cores that can efficiently run any modern operating system. In the manufacturing process of OR1200 a parallel HPRC is placed internally in the Integer Execution Pipeline unit of the CPU/DSP core to increase the performance. The GEP Parallel HPRC is activated /deactivated by triggering the signals i HPRC_Gene_Start ii HPRC_Gene_End. A Verilog HDL(Hardware Description language functional code for Gene Expression Programming parallel HPRC is developed and synthesised using XILINX ISE in the former part of the work and a CoreMark processor core benchmark is used to test the performance of the OR1200 soft core in the later part of the work. The result of the implementation ensures the overall speed-up increased to 20.59% by GEP based parallel HPRC in the execution unit of OR1200.

  20. Study and application on a new kind of event-driven processor%基于事件驱动的新型处理器的研究与应用

    Institute of Scientific and Technical Information of China (English)

    韩琳; 潘登

    2012-01-01

    A new kind of processor, event-driven multi-core processor, was studied to meet the requirement of iow costt high efficiency and high flexibility in modern electronics design. According to the study on the base architecture of the processor and comparison between designs of the new processor and traditional controller, it is conclude that the processor has the characteristics of high-performance, real-timed performance and easy programming. A new design method, soft hardware de-sign, is proposed to provide the new thought and reference for electronic system design.%针对现代电子设计低成本、高效率、高灵活性的特点,研究了一种新型的处理器:事件驱动多核心处理器.通过对这种处理器基本构架的研究,以及采用新型处理器与采用传统控制器设计差异的对比,分析出该处理器具有性能高、实时性强、易编程等优点.最后,提出了一种新的设计方法:硬件设计软件化,给众多电子系统设计提供新的思路和参考.

  1. FFT PROCESSOR IMPLEMENTATION & THROUGHPUT OPTIMIZATION USING DMA & C2H COMPILER

    Directory of Open Access Journals (Sweden)

    Varsha Adhangale

    2013-07-01

    Full Text Available Discrete Fourier Transform (DFT is an important transform in signal analysis and process, but its time complexity can’t be accepted under many situations. How to make DFT more fast and efficient has become an important theory. According to the algorithm characteristics of DFT, FFT was brought in and decreased the time complexity to a very large extent. This paper presents 8-point Fast Fourier transform (FFT processor using Altera tool & devices such as Nios II Soft processor on DE0 board, C2H Compiler & DMA. The NiosII is soft core processor which is implemented on FPGA available on Altera DE0 board. C2H Compiler is a powerful tool that generates hardware accelerators for software functions. The C2H Compiler enhances design productivity by allowing using a compiler to accelerate software algorithms in hardware. It can quickly prototype hardware functional changes in C, and explore hardware-software design tradeoffs in an efficient, iterative process. Performance was also increased by allowing accelerating only part of that software program on hardware. The C2H Compiler is well suited for improving computational bandwidth as well as memory throughput. It also provides a simpler way of computing complex multiplications, while decreasing latency time. DMA (Direct memory Access is also one of the important concepts applied to increase the efficiency of implemented system. So in this paper performance of implemented FFT Processor is observed by three different approach & it shows how system will useful in various signals processing applications.

  2. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    Science.gov (United States)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  3. High performance in silico virtual drug screening on many-core processors.

    Science.gov (United States)

    McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

    2015-05-01

    Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.

  4. RTEMS SMP for LEON3/LEON4 Multi-Processor Devicies

    Science.gov (United States)

    Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco

    2014-08-01

    When multi-core processors are used in the space industry, they are mostly used in AMP configurations. The cost of increased complexity and difficulty in analyzing SMP systems has been deemed too high in comparison with the benefits of more processing power. A reason for this is the lack of easy to analyze operating systems capable of SMP configurations.In this paper we present an European Space Agency (ESA) activity aimed at bringing easily accessible SMP support to GR712RC and ESA's future Next Generation Microprocessor (NGMP). This will be achieved by extending the RTEMS operating system with SMP capabilities and by providing parallel programming models and related libraries to exploit the intrinsic parallelism of space applications. The work will be validated by porting the single-core Gaia Video Processing Unit space application used in ESA's Gaia satellite project to RTEMS SMP running on GR712RC and NGMP.The paper describes the ongoing effort and gives an overview of the challenges faced in extending a real-time OS to the SMP domain. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.

  5. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  6. Dynamic Load Balancing using Graphics Processors

    Directory of Open Access Journals (Sweden)

    R Mohan

    2014-04-01

    Full Text Available To get maximum performance on the many-core graphics processors, it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. Both the dynamic load balancing methods using Static task assignment and work stealing using deques are compared to see which one is more suited to the highly parallel world of graphics processors. They have been evaluated on the task of simulating a computer move against the human move, in the famous four in a row game. The experiments showed that synchronization can be very expensive, and those new methods which use graphics processor features wisely might be required.

  7. Intrusion Detection Architecture Utilizing Graphics Processors

    Directory of Open Access Journals (Sweden)

    Branislav Madoš

    2012-12-01

    Full Text Available With the thriving technology and the great increase in the usage of computer networks, the risk of having these network to be under attacks have been increased. Number of techniques have been created and designed to help in detecting and/or preventing such attacks. One common technique is the use of Intrusion Detection Systems (IDS. Today, number of open sources and commercial IDS are available to match enterprises requirements. However, the performance of these systems is still the main concern. This paper examines perceptions of intrusion detection architecture implementation, resulting from the use of graphics processor. It discusses recent research activities, developments and problems of operating systems security. Some exploratory evidence is presented that shows capabilities of using graphical processors and intrusion detection systems. The focus is on how knowledge experienced throughout the graphics processor inclusion has played out in the design of intrusion detection architecture that is seen as an opportunity to strengthen research expertise.

  8. ETHERNET PACKET PROCESSOR FOR SOC APPLICATION

    Directory of Open Access Journals (Sweden)

    Raja Jitendra Nayaka

    2012-07-01

    Full Text Available As the demand for Internet expands significantly in numbers of users, servers, IP addresses, switches and routers, the IP based network architecture must evolve and change. The design of domain specific processors that require high performance, low power and high degree of programmability is the bottleneck in many processor based applications. This paper describes the design of ethernet packet processor for system-on-chip (SoC which performs all core packet processing functions, including segmentation and reassembly, packetization classification, route and queue management which will speedup switching/routing performance. Our design has been configured for use with multiple projects ttargeted to a commercial configurable logic device the system is designed to support 10/100/1000 links with a speed advantage. VHDL has been used to implement and simulated the required functions in FPGA.

  9. Programmable DNA-mediated multitasking processor

    CERN Document Server

    Shu, Jian-Jun; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-01-01

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.

  10. SWIFT Privacy: Data Processor Becomes Data Controller

    Directory of Open Access Journals (Sweden)

    Edwin Jacobs

    2007-04-01

    Full Text Available Last month, SWIFT emphasised the urgent need for a solution to compliance with US Treasury subpoenas that provides legal certainty for the financial industry as well as for SWIFT. SWIFT will continue its activities to adhere to the Safe Harbor framework of the European data privacy legislation. Safe Harbor is a framework negotiated by the EU and US in 2000 to provide a way for companies in Europe, with operations in the US, to conform to EU data privacy regulations. This seems to conclude a complex privacy case, widely covered by the US and European media. A fundamental question in this case was who is a data controller and who is a mere data processor. Both the Belgian and the European privacy authorities considered SWIFT, jointly with the banks, as a data controller whereas SWIFT had considered itself as a mere data processor that processed financial data for banks. The difference between controller and processor has far reaching consequences.

  11. The Associative Memory Boards for the FTK Processor at ATLAS

    CERN Document Server

    Calabro', D; The ATLAS collaboration; Citraro, S; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibena, M

    2013-01-01

    The Associative Memory (AM) system, the main part of the FastTracker (FTK) processor, is designed to perform pattern matching using the information of the silicon tracking detectors of the ATLAS experiment. It finds track candidates at low resolution that are seeds for the following step performing precise track fitting. The system has to support challenging data traffic, handled by a group of modern low-cost FPGAs, the Xilinx Artix 7 chips, which have Low-Power Gigabit Transceivers (GTPs). Each GTP is a combined transmitter and receiver capable of operating at data rates up to 7 Gb/s. The paper reports about the design and the initial tests of the most recent version of the AM system, based on the new AM chip design provided of serialized I/O. An estimation of the power consumption of the final system is also provided and the cooling system design is described. The first cooling tests results are reported.

  12. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Experiments at the LHC hadron collider search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and instantaneus luminosity increase, increasingly complex and exclusive selections are necessary. We present results and performances of a new prototype of Associative Memory (AM) system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the ATLAS experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the "combinatorial challenge", is beat by the AM technology exploiting parallelism to the maximum level. The Associative Memory compares the event to pre-calculated "expectations" or "patterns" (pattern matching) at once and look for candidate tracks called "roads". The problem is solved by the time data are loaded into the AM devices. We report ...

  13. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  14. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2014-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  15. Reconfigurable FFT Processor – A Broader Perspective Survey

    Directory of Open Access Journals (Sweden)

    V.Sarada

    2013-04-01

    Full Text Available The FFT(Fast Fourier Transform processing is one of the key procedure in the popular orthogonal frequency division multiplexing(OFDM based communication system such as Digital AudioBroadcasting(DAB,Digital Video Broadcasting Terrestrial(DVB- T,Asymmetric Digital Subscriber Loop(ADSL etc.These application domain require performing FFT in various size from 64 to 8192 point. Implementing each FFT on a dedicated IP presents a great overhead in silicon area of the chip. By supporting the different sizes of FFT for new wireless telecommunication standard may increase the time to market it. This consideration make FFT ideal candidate for reconfigurable implementation. Efficient implementation of the FFT processor with small area, low power and speed is very important. This survey paper aims at a study on efficient algorithm and architecture for reconfigurable FFT design and observes common traits of the good contribution.

  16. Interleaved Core Assignment for Bidirectional Transmission in Multi-Core Fibers

    DEFF Research Database (Denmark)

    Ye, Feihong; Morioka, Toshio

    2013-01-01

    We study interleaved core assignment for bidirectional transmission in multi-core fibers. By combining it with heterogeneous core structure in an 18-core fiber, the transmission distance is extended by 10 times compared to homogeneous core structure with unidirectional transmission, achieving...

  17. Reconfiguration in FPGA-Based Multi-Core Platforms for Hard Real-Time Applications

    DEFF Research Database (Denmark)

    Pezzarossa, Luca; Schoeberl, Martin; Sparsø, Jens

    2016-01-01

    In general-purpose computing multi-core platforms, hardware accelerators and reconfiguration are means to improve performance; i.e., the average-case execution time of a software application. In hard real-time systems, such average-case speed-up is not in itself relevant - it is the worst-case ex...... partial reconfiguration capabilities found in modern FPGAs....

  18. Multi-Core Emptiness Checking of Timed Büchi Automata using Inclusion Abstraction

    DEFF Research Database (Denmark)

    Laarman, Alfons; Olesen, Mads Chr.; Dalsgaard, Andreas

    2013-01-01

    is not preserved under this abstraction, but some other structural properties are preserved. Based on those, we propose a variation of the classical nested depth-first search (NDFS) algorithm that exploits subsumption. In addition, we extend the multi-core cndfs algorithm with subsumption, providing the first...

  19. Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

    Energy Technology Data Exchange (ETDEWEB)

    Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal; Kulkarni, Milind

    2017-01-26

    Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.

  20. MulticoreBSP for C : A high-performance library for shared-memory parallel programming

    NARCIS (Netherlands)

    Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.

    2014-01-01

    The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the prese